Sanjiv Ranjan Das
2016-12-11
Text expands the universe of data by many-fold. See my monograph on text mining in finance at: http://srdas.github.io/Das_TextAnalyticsInFinance.pdf
This covers some of the content of this presentation. These files are useful for the talk itself and you may run the program code as we proceed.
http://srdas.github.io/Temp/user2016/
Wikipedia defines it as - “… the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way. News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, `bag of words’, among other techniques.”
The R programming language is increasingly being used to download text from the web and then analyze it. The ease with which R may be used to scrape text from web site may be seen from the following simple command in R:
text = readLines("http://srdas.github.io/bio-candid.html")
text[15:20]## [1] "being an academic, he worked in the derivatives business in the"
## [2] "Asia-Pacific region as a Vice-President at Citibank. His current"
## [3] "research interests include: the modeling of default risk, machine"
## [4] "learning, social networks, derivatives pricing models, portfolio"
## [5] "theory, and venture capital. He has published over ninety articles in"
## [6] "academic journals, and has won numerous awards for research and"
Here, we downloaded the my bio page from my university’s web site. It’s a simple HTML file.
length(text)## [1] 79
Suppose we just want the 17th line, we do:
text[17]## [1] "research interests include: the modeling of default risk, machine"
And, to find out the character length of the this line we use the function:
library(stringr)## Warning: package 'stringr' was built under R version 3.2.5
str_length(text[17])## [1] 65
We have first invoked the library stringr that contains many string handling functions. In fact, we may also get the length of each line in the text vector by applying the function length() to the entire text vector.
text_len = str_length(text)
print(text_len)## [1] 6 69 0 66 70 70 70 63 69 65 68 67 64 67 63 64 65 64 69 63 68 70 39
## [24] 0 0 56 0 65 67 66 65 64 66 69 63 69 65 27 0 3 0 71 71 69 68 71
## [47] 12 0 3 0 71 70 68 71 69 63 67 69 64 67 7 0 3 0 67 71 65 63 72
## [70] 69 68 66 69 70 70 43 0 0 0
print(text_len[55])## [1] 69
text_len[17]## [1] 65
Some lines are very long and are the ones we are mainly interested in as they contain the bulk of the story, whereas many of the remaining lines that are shorter contain html formatting instructions. Thus, we may extract the top three lengthy lines with the following set of commands.
res = sort(text_len,decreasing=TRUE,index.return=TRUE)
idx = res$ix
text2 = text[idx]
text2## [1] "important to open the academic door to the ivory tower and let the world"
## [2] "Sanjiv is now a Professor of Finance at Santa Clara University. He came"
## [3] "to SCU from Harvard Business School and spent a year at UC Berkeley. In"
## [4] "previous lives into his current existence, which is incredibly confused"
## [5] "Sanjiv's research style is instilled with a distinct \"New York state of"
## [6] "funds, the internet, portfolio choice, banking models, credit risk, and"
## [7] "ocean. The many walks in Greenwich village convinced him that there is"
## [8] "Santa Clara University's Leavey School of Business. He previously held"
## [9] "faculty appointments as Associate Professor at Harvard Business School"
## [10] "and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and"
## [11] "published in May 2010. He currently also serves as a Senior Fellow at"
## [12] "mind\" - it is chaotic, diverse, with minimal method to the madness. He"
## [13] "any time you like, but you can never leave.\" Which is why he is doomed"
## [14] "to a lifetime in Hotel California. And he believes that, if this is as"
## [15] "<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">"
## [16] "Berkeley), an MBA from the Indian Institute of Management, Ahmedabad,"
## [17] "theory, and venture capital. He has published over ninety articles in"
## [18] "science fiction movies, and writing cool software code. When there is"
## [19] "academic papers, which helps him relax. Always the contrarian, Sanjiv"
## [20] "his past life in the unreal world, Sanjiv worked at Citibank, N.A. in"
## [21] "has unpublished articles in many other areas. Some years ago, he took"
## [22] "There he learnt about the fascinating field of Randomized Algorithms,"
## [23] "in. Academia is a real challenge, given that he has to reconcile many"
## [24] "explains, you never really finish your education - \"you can check out"
## [25] "College), and is also a qualified Cost and Works Accountant. He is a"
## [26] "teaching. His recent book \"Derivatives: Principles and Practice\" was"
## [27] "the Asia-Pacific region. He takes great pleasure in merging his many"
## [28] "has published articles on derivatives, term-structure models, mutual"
## [29] "more opinions than ideas. He has been known to have turned down many"
## [30] "senior editor of The Journal of Investment Management, co-editor of"
## [31] "Research, and Associate Editor of other academic journals. Prior to"
## [32] "growing up, Sanjiv moved to New York to change the world, hopefully"
## [33] "confirming that an unchecked hobby can quickly become an obsession."
## [34] "pursuits, many of which stem from being in the epicenter of Silicon"
## [35] "Coastal living did a lot to mold Sanjiv, who needs to live near the"
## [36] "Sanjiv Das is the William and Janice Terry Professor of Finance at"
## [37] "through research. He graduated in 1994 with a Ph.D. from NYU, and"
## [38] "mountains meet the sea, riding sport motorbikes, reading, gadgets,"
## [39] "offers from Mad magazine to publish his academic work. As he often"
## [40] "B.Com in Accounting and Economics (University of Bombay, Sydenham"
## [41] "research interests include: the modeling of default risk, machine"
## [42] "After loafing and working in many parts of Asia, but never really"
## [43] "since then spent five years in Boston, and now lives in San Jose,"
## [44] "thinks that New York City is the most calming place in the world,"
## [45] "no such thing as a representative investor, yet added many unique"
## [46] "The Journal of Derivatives and The Journal of Financial Services"
## [47] "Asia-Pacific region as a Vice-President at Citibank. His current"
## [48] "learning, social networks, derivatives pricing models, portfolio"
## [49] "California. Sanjiv loves animals, places in the world where the"
## [50] "skills he now applies earnestly to his editorial work, and other"
## [51] "Ph.D. from New York University), Computer Science (M.S. from UC"
## [52] "being an academic, he worked in the derivatives business in the"
## [53] "academic journals, and has won numerous awards for research and"
## [54] "time available from the excitement of daily life, Sanjiv writes"
## [55] "time off to get another degree in computer science at Berkeley,"
## [56] "features to his personal utility function. He learnt that it is"
## [57] "<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>"
## [58] "bad as it gets, life is really pretty good."
## [59] "the FDIC Center for Financial Research."
## [60] "after California of course."
## [61] "and diverse."
## [62] "Valley."
## [63] "<HTML>"
## [64] "<p>"
## [65] "<p>"
## [66] "<p>"
## [67] ""
## [68] ""
## [69] ""
## [70] ""
## [71] ""
## [72] ""
## [73] ""
## [74] ""
## [75] ""
## [76] ""
## [77] ""
## [78] ""
## [79] ""
In short, text extraction can be exceedingly simple, though getting clean text is not as easy an operation. Removing html tags and other unnecessary elements in the file is also a fairly simple operation. We undertake the following steps that use generalized regular expressions (i.e., grep) to eliminate html formatting characters.
This will generate one single paragraph of text, relatively clean of formatting characters. Such a text collection is also known as a “bag of words”.
text = paste(text,collapse="\n")
print(text)## [1] "<HTML>\n<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">\n\nSanjiv Das is the William and Janice Terry Professor of Finance at\nSanta Clara University's Leavey School of Business. He previously held\nfaculty appointments as Associate Professor at Harvard Business School\nand UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and\nPh.D. from New York University), Computer Science (M.S. from UC\nBerkeley), an MBA from the Indian Institute of Management, Ahmedabad,\nB.Com in Accounting and Economics (University of Bombay, Sydenham\nCollege), and is also a qualified Cost and Works Accountant. He is a\nsenior editor of The Journal of Investment Management, co-editor of\nThe Journal of Derivatives and The Journal of Financial Services\nResearch, and Associate Editor of other academic journals. Prior to\nbeing an academic, he worked in the derivatives business in the\nAsia-Pacific region as a Vice-President at Citibank. His current\nresearch interests include: the modeling of default risk, machine\nlearning, social networks, derivatives pricing models, portfolio\ntheory, and venture capital. He has published over ninety articles in\nacademic journals, and has won numerous awards for research and\nteaching. His recent book \"Derivatives: Principles and Practice\" was\npublished in May 2010. He currently also serves as a Senior Fellow at\nthe FDIC Center for Financial Research.\n\n\n<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>\n\nAfter loafing and working in many parts of Asia, but never really\ngrowing up, Sanjiv moved to New York to change the world, hopefully\nthrough research. He graduated in 1994 with a Ph.D. from NYU, and\nsince then spent five years in Boston, and now lives in San Jose,\nCalifornia. Sanjiv loves animals, places in the world where the\nmountains meet the sea, riding sport motorbikes, reading, gadgets,\nscience fiction movies, and writing cool software code. When there is\ntime available from the excitement of daily life, Sanjiv writes\nacademic papers, which helps him relax. Always the contrarian, Sanjiv\nthinks that New York City is the most calming place in the world,\nafter California of course.\n\n<p>\n\nSanjiv is now a Professor of Finance at Santa Clara University. He came\nto SCU from Harvard Business School and spent a year at UC Berkeley. In\nhis past life in the unreal world, Sanjiv worked at Citibank, N.A. in\nthe Asia-Pacific region. He takes great pleasure in merging his many\nprevious lives into his current existence, which is incredibly confused\nand diverse.\n\n<p>\n\nSanjiv's research style is instilled with a distinct \"New York state of\nmind\" - it is chaotic, diverse, with minimal method to the madness. He\nhas published articles on derivatives, term-structure models, mutual\nfunds, the internet, portfolio choice, banking models, credit risk, and\nhas unpublished articles in many other areas. Some years ago, he took\ntime off to get another degree in computer science at Berkeley,\nconfirming that an unchecked hobby can quickly become an obsession.\nThere he learnt about the fascinating field of Randomized Algorithms,\nskills he now applies earnestly to his editorial work, and other\npursuits, many of which stem from being in the epicenter of Silicon\nValley.\n\n<p>\n\nCoastal living did a lot to mold Sanjiv, who needs to live near the\nocean. The many walks in Greenwich village convinced him that there is\nno such thing as a representative investor, yet added many unique\nfeatures to his personal utility function. He learnt that it is\nimportant to open the academic door to the ivory tower and let the world\nin. Academia is a real challenge, given that he has to reconcile many\nmore opinions than ideas. He has been known to have turned down many\noffers from Mad magazine to publish his academic work. As he often\nexplains, you never really finish your education - \"you can check out\nany time you like, but you can never leave.\" Which is why he is doomed\nto a lifetime in Hotel California. And he believes that, if this is as\nbad as it gets, life is really pretty good.\n\n\n"
text = str_replace_all(text,"[<>{}()&;,.\n]"," ")
print(text)## [1] " HTML BODY background=\"http://algo scu edu/~sanjivdas/graphics/back2 gif\" Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara University's Leavey School of Business He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley He holds post-graduate degrees in Finance M Phil and Ph D from New York University Computer Science M S from UC Berkeley an MBA from the Indian Institute of Management Ahmedabad B Com in Accounting and Economics University of Bombay Sydenham College and is also a qualified Cost and Works Accountant He is a senior editor of The Journal of Investment Management co-editor of The Journal of Derivatives and The Journal of Financial Services Research and Associate Editor of other academic journals Prior to being an academic he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank His current research interests include: the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital He has published over ninety articles in academic journals and has won numerous awards for research and teaching His recent book \"Derivatives: Principles and Practice\" was published in May 2010 He currently also serves as a Senior Fellow at the FDIC Center for Financial Research p B Sanjiv Das: A Short Academic Life History /B p After loafing and working in many parts of Asia but never really growing up Sanjiv moved to New York to change the world hopefully through research He graduated in 1994 with a Ph D from NYU and since then spent five years in Boston and now lives in San Jose California Sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code When there is time available from the excitement of daily life Sanjiv writes academic papers which helps him relax Always the contrarian Sanjiv thinks that New York City is the most calming place in the world after California of course p Sanjiv is now a Professor of Finance at Santa Clara University He came to SCU from Harvard Business School and spent a year at UC Berkeley In his past life in the unreal world Sanjiv worked at Citibank N A in the Asia-Pacific region He takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse p Sanjiv's research style is instilled with a distinct \"New York state of mind\" - it is chaotic diverse with minimal method to the madness He has published articles on derivatives term-structure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas Some years ago he took time off to get another degree in computer science at Berkeley confirming that an unchecked hobby can quickly become an obsession There he learnt about the fascinating field of Randomized Algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of Silicon Valley p Coastal living did a lot to mold Sanjiv who needs to live near the ocean The many walks in Greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function He learnt that it is important to open the academic door to the ivory tower and let the world in Academia is a real challenge given that he has to reconcile many more opinions than ideas He has been known to have turned down many offers from Mad magazine to publish his academic work As he often explains you never really finish your education - \"you can check out any time you like but you can never leave \" Which is why he is doomed to a lifetime in Hotel California And he believes that if this is as bad as it gets life is really pretty good "
The XML package in R also comes with many functions that aid in cleaning up text and dropping it (mostly unformatted) into a flat file or data frame. This may then be further processed. Here is some example code for this.
The following example has been adapted from r-bloggers.com. It uses the following URL:
http://www.w3schools.com/xml/plant_catalog.xml
library(XML)## Warning: package 'XML' was built under R version 3.2.4
#Part1: Reading an xml and creating a data frame with it.
xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
xmlfile <- xmlTreeParse(xml.url)
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]## COMMON BOTANICAL ZONE LIGHT
## 1 Bloodroot Sanguinaria canadensis 4 Mostly Shady
## 2 Columbine Aquilegia canadensis 3 Mostly Shady
## 3 Marsh Marigold Caltha palustris 4 Mostly Sunny
## 4 Cowslip Caltha palustris 4 Mostly Shady
## 5 Dutchman's-Breeches Dicentra cucullaria 3 Mostly Shady
#Example adapted from https://stat.ethz.ch/pipermail/r-help/2008-September/175364.html
#Load the iris data set and create a data frame
data("iris")
data <- as.data.frame(iris)
xml <- xmlTree()
xml$addTag("document", close=FALSE)## Warning in xmlRoot.XMLInternalDocument(currentNodes[[1]]): empty XML
## document
for (i in 1:nrow(data)) {
xml$addTag("row", close=FALSE)
for (j in names(data)) {
xml$addTag(j, data[i, j])
}
xml$closeTag()
}
xml$closeTag()
#view the xml (uncomment line below to see XML, long output)
#cat(saveXML(xml))First, let’s read in a simple web page (my landing page)
text = readLines("http://srdas.github.io/")
print(text[1:4])## [1] "<html>"
## [2] ""
## [3] "<head>"
## [4] "<title>SCU Web Page of Sanjiv Ranjan Das</title>"
print(length(text))## [1] 36
String handling is a basic need, so we use the stringr package.
#EXTRACTING SUBSTRINGS (take some time to look at
#the "stringr" package also)
library(stringr)
substr(text[4],24,29)## [1] "Sanjiv"
#IF YOU WANT TO LOCATE A STRING
res = regexpr("Sanjiv",text[4])
print(res)## [1] 24
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
print(substr(text[4],res[1],res[1]+nchar("Sanjiv")-1))## [1] "Sanjiv"
#ANOTHER WAY
res = str_locate(text[4],"Sanjiv")
print(res)## start end
## [1,] 24 29
print(substr(text[4],res[1],res[2]))## [1] "Sanjiv"
Now we look at using regular expressions with the grep command to clean out text. I will read in my research page to process this. Here we are undertaking a “ruthless” cleanup.
#SIMPLE TEXT HANDLING
text = readLines("http://srdas.github.io/research.htm")
print(length(text))## [1] 823
print(text)## [1] "<HTML>"
## [2] "<HEAD>"
## [3] "<TITLE>Research of Professor Sanjiv Ranjan Das</TITLE>"
## [4] "<BASE HREF=\"http://srdas.github.io/\">"
## [5] "</HEAD>"
## [6] "<BODY background=\"http://srdas.github.io/graphics/back2.gif\">"
## [7] ""
## [8] "<H2>BOOKS and MONOGRAPHS</H2>"
## [9] ""
## [10] "<OL reversed>"
## [11] ""
## [12] "<LI><img src=\"graphics/DSTMAA.png\" width=\"50\" height=\"65\">"
## [13] "\"Data Science: Theories, Models, Algorithms, and Analytics\" (web book -- work in progress)"
## [14] "<a href=\"http://srdas.github.io/Papers/DSA_Book.pdf\">Read here.</a>"
## [15] ""
## [16] ""
## [17] "<LI><img src=\"graphics/derbook_cover.png\" width=\"50\" height=\"65\">"
## [18] "\"Derivatives: Principles and Practice\" (2010),"
## [19] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."
## [20] "<a href=\"http://www.amazon.com/Derivatives-Rangarajan-Sundaram/dp/0072949317/ref=sr_1_1?ie=UTF8&s=books&qid=1268798971&sr=8-1\">[Amazon]</a>"
## [21] "<a href=\"http://productsearch.barnesandnoble.com/search/results.aspx?WRD=sundaram+das\">[BarnesNoble]</a>"
## [22] ""
## [23] "</OL>"
## [24] ""
## [25] "<H2>REFEREED JOURNAL PUBLICATIONS</H2>"
## [26] ""
## [27] "<OL reversed>"
## [28] ""
## [29] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"
## [30] "\"An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."
## [31] "Forthcoming, <I>Journal of Banking and Finance</I>."
## [32] "<br>[<I> [Develops a new measure of liquidity for all sectors of the markets using ETFs. "
## [33] "RFinance Best Paper Award, May 2016. This paper won the S&P SPIVA 2012 Award for innovation of an index.</I>]"
## [34] "<a href=\"Papers/etfliq.pdf\">[PDF]</a>"
## [35] "</LI>"
## [36] ""
## [37] "<LI><img src=\"graphics/JAI.png\" width=\"55\" height=\"40\">"
## [38] "\"Matrix Metrics: Network-Based Systemic Risk Scoring\", (2016)."
## [39] "<I>Journal of Alternative Investments</I>, Special Issue on Systemic Risk, v18(4), 33-51."
## [40] "<br>[<I>A new approach to identifying system-wide financial risk, SIFIs, and several other measures"
## [41] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "
## [42] "the best paper on SIFIs (systemically important financial institutions). "
## [43] "It also won the best paper award at "
## [44] "the R Finance conference, Chicago 2015. </I>]"
## [45] "<a href=\"Papers/JAI_Das_issue.pdf\">[PDF of paper]</a>"
## [46] "<a href=\"Papers/JAI_EditorsLetter_issue.pdf\">[Editor's letter re Special Issue]</a>"
## [47] "<a href=\"Papers/JAI_Getmansky_Stein_issue.pdf\">[Editor's overview]</a>"
## [48] "<a href=\"Papers/RiskNetworks_slides_RFinance_2015_05.pdf\">[SLIDES RFinance]</a>. "
## [49] "</LI>"
## [50] ""
## [51] ""
## [52] ""
## [53] ""
## [54] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"
## [55] "\"Credit Spreads with Dynamic Debt\" (with Seoyoung Kim), (2015), "
## [56] "<I>Journal of Banking and Finance</I>, v50, 121-140."
## [57] "<a href=\"Papers/DasKim_JBF2015_FINAL.pdf\">[PDF]</a>"
## [58] "<br>[<I>Extends the Merton risky debt model from static debt to dynamic debt"
## [59] "and generates credit spread term structures that are closer to those in the data</I>]"
## [60] "</LI>"
## [61] ""
## [62] "<LI><img src=\"graphics/FTF.jpg\" width=\"40\" height=\"55\">"
## [63] "\"Text and Context: Language Analytics for Finance\", (2014),"
## [64] "<I>Foundations and Trends in Finance</I>, v8(3), 145-260. "
## [65] "<a href=\"Papers/Das_TextAnalyticsInFinance.pdf\">[PDF]</a>"
## [66] "<br>[<I>A comprehensive survey of comcepts, tools, techniques, and empirical "
## [67] "literature on textual processing in finance.</I>]"
## [68] ""
## [69] ""
## [70] "<LI><img src=\"graphics/jfe.gif\" width=\"40\" height=\"55\">\"Did CDS Trading Improve the Market for Corporate Bonds?\" (with Madhu Kalimipalli and Subhankar Nayak), (2014), <I>Journal of Financial Economics</I> 111, 495-525."
## [71] "<a href=\"Papers/cdsbondeff.pdf\">[PDF]</a>"
## [72] "<br>[<I>The inception of CDS trading in a reference name renders its bonds less efficient, with no improvement in market quality or liquidity</I>]"
## [73] "</LI>"
## [74] ""
## [75] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"
## [76] "\"Strategic Loan Modification: An Options-Based Response to Strategic Default,\""
## [77] "(with Ray Meadows), (2013), <I>Journal of Banking and Finance</I> 37, 636-647. "
## [78] "<a href=\"Papers/sam.pdf\">[PDF]</a>"
## [79] "<br>[<I>A closed-form solution for mortgage debt with default and optimal loan modificatoin thereon.</I>]"
## [80] "</LI>"
## [81] ""
## [82] ""
## [83] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"
## [84] "\"Options and Structured Products in Behavioral Portfolios,\" (with Meir Statman), (2013), "
## [85] "<I>Journal of Economic Dynamics and Control</I>, 37(1), 137-153."
## [86] "<a href=\"Papers/JEDC_FINAL_PROOF.pdf\">[PDF]</a>"
## [87] "<br>[<I>Explores the roles in behavioral portfolios of option collars, capital guaranteed notes, "
## [88] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."
## [89] "</I>]"
## [90] "</LI>"
## [91] ""
## [92] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\"> "
## [93] "\"The Principal Principle,\" (2012), <I>Journal of Financial and QuantitativeAnalysis</I>, 47(6), 1215-1246. "
## [94] "<a href=\"http://journals.cambridge.org/repo_A884JKBk\">[PDF]</a>"
## [95] "<br>[<I>Optimal approaches for mortgage loan modification. Principal reduction is optimal, and better than rate reductions, maturity extensions, and principal forebearance. Shared-appreciation mortgages solve moral hazard.</I>]"
## [96] "</LI>"
## [97] ""
## [98] "<LI><img src=\"graphics/IEEE.gif\" width=\"40\" height=\"55\"> "
## [99] "\"Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study,\" (2011), (with Douglas Burdick, Mauricio A. Hernandez, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan), <I>IEEE Data Engineering Bulletin</I>, 34(3), 60-67."
## [100] "<a href=\"Papers/midaswww2011_FINAL.pdf\">[PDF older version]</a>"
## [101] "<a href=\"Papers/midas-deb_July2011.pdf\">[PDF final version]</a>"
## [102] ""
## [103] "<LI><img src=\"graphics/jfint_cover.gif\" width=\"40\" height=\"55\"> "
## [104] "\"Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance,\" (2011), (with Hoje Jo and Yongtae Kim), "
## [105] "<I>Journal of Financial Intermediation</I> 20(2), 199--230."
## [106] "<a href=\"Papers/synd.pdf\">[PDF]</a>"
## [107] "<br>[<I>Syndicate-financed firms fare better---higher return multiples come from better selection, but time-to-exit and likelihood of exit are better on accont of superior monitoring by the syndicate.</I>]"
## [108] "</LI>"
## [109] ""
## [110] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\"> \"Portfolio"
## [111] "Optimization with Mental Accounts,\" (2010), (with Harry Markowitz, Jonathan"
## [112] "Scheid, and Meir Statman), <I>Journal of Financial and Quantitative"
## [113] "Analysis</I>, v45(2), 311-334."
## [114] "<a href=\"http://journals.cambridge.org/repo_A772rEdS\">[PDF (copyright: Cambridge University Press)]</a>"
## [115] "<br>[<I>Mean-variance optimization is reconciled with behavioral porfolio theory. Mental "
## [116] "accounts optimization leads to better aggregate portfolios.</I>]"
## [117] "</LI>"
## [118] ""
## [119] "<LI><img src=\"graphics/jcr.gif\" width=\"40\" height=\"55\">"
## [120] "\"The Long and Short of it: Why are stocks with shorter run-lengths preferred?\" (2010), (with Priya Raghubir), <I>Journal of Consumer Research</I>. 36(6), 964-982."
## [121] "<a href=\"Papers/runlength.pdf\">[PDF]</a>, "
## [122] "<a href=\"Papers/runlength_summary.pdf\">[Non-technical summary]</a>"
## [123] "<br>[<I>People responding to stock charts are systematically biased against stocks with longer run lengths, even if these stocks are no riskier than those with shorter runs.</I>]"
## [124] "</LI>"
## [125] ""
## [126] ""
## [127] "<LI><img src=\"graphics/anor.jpg\" width=\"40\" height=\"55\">"
## [128] "\"Run Lengths and Liquidity,\" (with Paul Hanouna), (2010), <I>Annals of Operations Resarch</I>, Special Issue on Risk and Uncertainty, 176(1), 127-152."
## [129] "<a href=\"Papers/rs.pdf\">[PDF]</a>"
## [130] "<br>[<I>The run signature of a stock is shown to be mathematically related to liquidity. Runs are "
## [131] "priced factors. </I>]"
## [132] "</LI>"
## [133] ""
## [134] ""
## [135] ""
## [136] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"
## [137] "\"Implied Recovery,'' (with Paul Hanouna), (2009), <I>Journal of Economic Dynamics and Control</I>, 33(11), 1837-1857."
## [138] "<a href=\"Papers/imprec.pdf\">[PDF]</a>"
## [139] "<br>[<I>How to use the term structure of CDS spreads to jointly identify the term structures of forward default probability and recovery rates. </I>]"
## [140] "</LI>"
## [141] ""
## [142] ""
## [143] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"
## [144] "\"Accounting-based versus market-based cross-sectional models of CDS spreads,\" "
## [145] "(with Paul Hanouna and Atulya Sarin), (2009), "
## [146] "<I>Journal of Banking and Finance</I>, 33, 719-730. "
## [147] "<a href=\"Papers/JBF_final_3.pdf\">[PDF]</a>"
## [148] "<br>[<I>Accounting models explain spreads as well as market-based ones, but a hybrid mix does best.</I>]"
## [149] "</LI>"
## [150] ""
## [151] ""
## [152] "<LI><img src=\"graphics/jfint_cover.gif\" width=\"40\" height=\"55\"> "
## [153] "\"Hedging Credit: Equity Liquidity Matters,\" (with Paul Hanouna), (2009),"
## [154] "<I>Journal of Financial Intermediation</I>, v18(1), 112-123"
## [155] "<a href=\"Papers/cdsliq.pdf\">[PDF]</a>"
## [156] "<br>[<I>Hedging in CDS markets provides a mechanism by which equity market liquidity impacts CDS spreads </I>]"
## [157] "</LI>"
## [158] ""
## [159] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"
## [160] "\"An Integrated Model for Hybrid Securities,\""
## [161] "(with Raghu Sundaram), (2007), <I>Management Science</I>, v53, 1439-1451."
## [162] "<a href=\"Papers/rsx_FINAL.pdf\">[PDF]</a>"
## [163] "<br>[<I>A general flexible model for pricing derivative securities that depend on equity, "
## [164] "interest rate and credit risk, using observables. Delivers dynamic implied default probabilities.</I>]"
## [165] "</LI>"
## [166] ""
## [167] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"
## [168] "\"Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,\""
## [169] "(with Mike Chen), (2007), <I>Management Science</I>, v53, 1375-1388."
## [170] "<a href=\"Papers/chat_FINAL.pdf\">[PDF]</a>"
## [171] "<br>[<I>A methodology for parsing internet stock chat to develop a sentiment index. Assesses"
## [172] "whether small traders opinions contain information not in prices. </I>]"
## [173] "</LI>"
## [174] ""
## [175] "<LI><img src=\"graphics/JF_cover.jpg\" width=\"120\" height=\"55\">"
## [176] "\"Common Failings: How Corporate Defaults are Correlated\" "
## [177] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."
## [178] "(2007) <I>Journal of Finance</I>, v62, 93-117. "
## [179] "<a href=\"Papers/ddks.pdf\">[PDF]</a>"
## [180] "<br>[<I>New approach to test for defaul contagion using a stochastic time change. "
## [181] "Doubly stochastic models are refuted by the data.</I>]"
## [182] "</LI>"
## [183] ""
## [184] "<LI><img src=\"graphics/fmalogo_main.gif\" width=\"40\" height=\"55\">"
## [185] "\"A Clinical Study of Investor Discussion and Sentiment,\" "
## [186] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "
## [187] "<I>Financial Management</I>, v34(5), 103-137."
## [188] "<a href=\"Papers/einfo.pdf\">[PDF]</a>"
## [189] "<br>[<I>Examines the interaction of chat room information and news. </I>]"
## [190] "</LI>"
## [191] ""
## [192] "<LI><img src=\"graphics/JF_cover.jpg\" width=\"120\" height=\"55\">"
## [193] "\"International Portfolio Choice with Systemic Risk,\""
## [194] "(with Raman Uppal), 2004, <I>Journal of Finance</I>, v59(6), 2809-2834."
## [195] "<a href=\"Papers/systemic.pdf\">[PDF]</a>"
## [196] "<br>[<I>A model for portfolio optimization with systemic risk. "
## [197] "The loss resulting from diminished diversification is small, while"
## [198] "that from holding very highly levered positions is large. </I>]"
## [199] "</LI>"
## [200] ""
## [201] "<LI><img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\"> \"Fee"
## [202] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"
## [203] "Investor Welfare,'' (with Rangarajan Sundaram), 2002, <i>Review of"
## [204] "Financial Studies</i>, v15, 1465-1497."
## [205] "<a href=\"Papers/fees.pdf\">[PDF]</a>"
## [206] "<br><I>[Compares fulcrum vs incentive fees structures from the standpoint of "
## [207] "investor welfare. Contrary to regulatory intuition, incentive structures"
## [208] "are often optimal.] </I>"
## [209] "</LI>"
## [210] ""
## [211] "<LI><img src=\"graphics/FAJ_cover.gif\" width=\"140\" height=\"55\">"
## [212] "\"A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"
## [213] "with Rating Transitions,\" (with Viral Acharya and Rangarajan Sundaram),"
## [214] "2002, <I>Financial Analysts Journal</I>, May-June, 28-44."
## [215] "<a href=\"Papers/dsmarkov.pdf\">[PDF]</a>"
## [216] "<br><I>[A HJM type two-factor model in risk free rates and spreads that also accounts "
## [217] "for rating transitions, allowing seamless pricing of many credit derivatives. ] </I>"
## [218] "</LI>"
## [219] ""
## [220] "<LI><img src=\"graphics/JOE_cover.gif\" width=\"40\" height=\"55\">"
## [221] "\"The Surprise Element: Jumps in Interest Rates\", 2002, <I>Journal of"
## [222] "Econometrics</I>, v106, 27-65."
## [223] "<a href=\"Papers/jump.pdf\">[PDF]</a>"
## [224] "<br><I>[Estimation methodology for interest rates with jumps. A flexible "
## [225] "specification that accommodates Federal Reserve Activity.]</I>"
## [226] "</LI>"
## [227] ""
## [228] "<LI><img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\">"
## [229] "\"Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"
## [230] " 2002, <I>Review of Financial Studies</I>, v15(1), 195-241."
## [231] "<a href=\"Papers/affine.pdf\">[PDF]</a>"
## [232] "<br><I>[General affine option pricing for interest rate derivatives covering a "
## [233] "wide range of securities, allowing for M factors with N diffusions and L jumps.] </I>"
## [234] "</LI>"
## [235] ""
## [236] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"
## [237] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "
## [238] "(with Rangarajan Sundaram), 2000, <I>Management Science</I>, v46(1), 46-62."
## [239] "<a href=\"msfinal.ps\">[PS]</a>"
## [240] "<br><I>[HJM style two factor model for credit risk. ] </I>"
## [241] "</LI>"
## [242] ""
## [243] "<LI><img src=\"graphics/FAJ_cover.gif\" width=\"140\" height=\"55\">"
## [244] "\"The Psychology of Financial Decision Making: A Case"
## [245] "for Theory-Driven Experimental Enquiry,''"
## [246] "1999, (with Priya Raghubir),"
## [247] "<I>Financial Analyst's Journal</I>, Nov-Dec 1999, v55(6), 56-79."
## [248] "<br><I>[Surveys the anomalies literature in Finance and shows how experimental"
## [249] "studies may be used to disentangle competing hypotheses for the same anomaly.]</I>"
## [250] "</LI>"
## [251] ""
## [252] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\">"
## [253] "\"Of Smiles and Smirks: A Term Structure Perspective,''"
## [254] "1999, (with Rangarajan Sundaram), <I>Journal of"
## [255] "Financial and Quantitative Analysis</I>, v34(2), 211-240."
## [256] "<a href=\"Papers/skew.pdf\">[PDF]</a>"
## [257] "<br><I>[Explains how the shape of the volatility smile is determined by "
## [258] "jumps and stochastic volatility. ]</I>"
## [259] "</LI>"
## [260] ""
## [261] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"
## [262] "\"A Theory of Banking Structure,\" 1999, (with Ashish Nanda),"
## [263] "<I>Journal of Banking and Finance</I>, v23(6), 863-895."
## [264] "<br><I>[A theory to analyze the specialization of banking activities based "
## [265] "by function based upon two dimensions: the degree of information asymmetry "
## [266] "and the degree of verifiability of the value of the service rendered. ]</I>"
## [267] "</LI>"
## [268] ""
## [269] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"
## [270] "\"A Theory of Optimal Timing and Selectivity,'' "
## [271] "(with George Chacko), 1999, <I>Journal of"
## [272] "Economic Dynamics and Control</I>, v23(7), 929-966."
## [273] "<br><I>[Dynamic optimal portfolio choice model for determining optimal effort"
## [274] "allocation to timing and stock selection in asset allocation.]</I>"
## [275] "</LI>"
## [276] ""
## [277] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"
## [278] "\"A Direct Discrete-Time Approach to"
## [279] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "
## [280] "Model,\" 1999, <I>Journal of Economic Dynamics and Control</I>, v23(3), 333-369."
## [281] "<br><I>[HJM tree with jumps. Fast, fully recombining dynamics. ] </I>"
## [282] "</LI>"
## [283] ""
## [284] "<LI><img src=\"graphics/RESTAT_cover.jpg\" width=\"40\" height=\"55\">"
## [285] "\"The Central Tendency: A Second Factor in"
## [286] "Bond Yields,\" 1998, (with Silverio Foresi and Pierluigi Balduzzi), "
## [287] "<I>The Review of Economics and Statistics</I>, v80(1), 60-72."
## [288] "<br><I>[Model of the term structure with stochastic long-run mean. Related to "
## [289] "Federal Reserve acitivity.]</I>"
## [290] "<a href=\"Papers/BalduzziDasForesi_ReStat1998_CentralTendency.pdf\">[PDF]</a>"
## [291] "</LI>"
## [292] ""
## [293] "<LI> <img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\">"
## [294] "\"Efficiency with Costly Information: A Reinterpretation of"
## [295] "Evidence from Managed Portfolios,\" (with Edwin Elton, Martin Gruber and Matt "
## [296] "Hlavka), <I>Review of Financial Studies</I>, vol. 6(1), 1993, pp 1-22. "
## [297] "<a href=\"Papers/EGDH.pdf\">[PDF]</a>"
## [298] "<br><I>[Mutual funds are not informationally efficient. "
## [299] "You are better off buying the index.] </I>"
## [300] "<br>"
## [301] "Presented and Reprinted in the Proceedings of The "
## [302] "Seminar on the Analysis of Security Prices at the Center "
## [303] "for Research in Security Prices at the University of "
## [304] "Chicago, Graduate School of Business. </LI>"
## [305] ""
## [306] ""
## [307] ""
## [308] ""
## [309] ""
## [310] "<H2>MORE REFEREED JOURNAL PUBLICATIONS</H2>"
## [311] ""
## [312] ""
## [313] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\"> "
## [314] "\"Managing Rollover Risk with Capital Structure Covenants"
## [315] "in Structured Finance Vehicles\" (2016),"
## [316] "(with Seoyoung Kim), forthcoming <I>Journal of Fixed Income</I>."
## [317] "<a href=\"Papers/siv_JFI.pdf\">[PDF]</a>"
## [318] "<br><I>[We propose a covenant-based capital structure that mitigates rollover problems in SIVs and is Pareto-improving for equity and debt holders in the SPV.]</I>"
## [319] "</LI>"
## [320] ""
## [321] ""
## [322] ""
## [323] "<LI><img src=\"graphics/JRFM.png\" width=\"40\" height=\"55\"> "
## [324] "\"The Design and Risk Management of Structured Finance Vehicles\" (2016),"
## [325] "(with Seoyoung Kim), forthcoming, <I>Journal of Risk and Financial Management</I>, Special Issue on Credit Risk."
## [326] "<a href=\"Papers/siv_JRFM.pdf\">[PDF]</a>"
## [327] "<br><I>[Risk management for special investment vehicles is difficult, but necessary. "
## [328] "Post the recent subprime financial crisis, we inform the creation of safer SIVs "
## [329] "in structured finance, and propose avenues of mitigating risks faced by senior debt through "
## [330] "deleveraging policies in the form of leverage risk controls and contingent capital.]</I>"
## [331] "</LI>"
## [332] ""
## [333] ""
## [334] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"
## [335] "\"Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework\" (2014), "
## [336] "(with Seoyoung Kim and Meir Statman), "
## [337] "<I>Journal of Portfolio Management</I>, 41(1), 95-108."
## [338] "<a href=\"Papers/underfunded.pdf\">[PDF]</a>"
## [339] "<br><I>[Provides a new definition of underfunded portfolios, and compares four remedies for underfunding.]</I>"
## [340] "</LI>"
## [341] ""
## [342] ""
## [343] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\"> "
## [344] "\"Going for Broke: Restructuring Distressed Debt Portfolios\" (2014),"
## [345] "(with Seoyoung Kim), <I>Journal of Fixed Income</I>, 24(3), 5-27."
## [346] "<a href=\"Papers/ddo.pdf\">[PDF]</a>"
## [347] "<br><I>[Optimizing portfolios where the return distributions of the assets is endogenous. The gains from restructuring distressed debt portfolios are large.]</I>"
## [348] "</LI>"
## [349] ""
## [350] ""
## [351] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"
## [352] "\"Digital Portfolios.\" (2013), "
## [353] "<I>Journal of Portfolio Management</I>, v39(2), 41-48."
## [354] "<a href=\"Papers/vport.pdf\">[PDF]</a>"
## [355] "<br><I>[Constructing portfolios of assets with a binary payoff, large versus zero, and the differences in this optimization versus standard mean-variance portfolio construction.]</I>"
## [356] "</LI>"
## [357] ""
## [358] ""
## [359] "<LI><img src=\"graphics/frl.jpg\" width=\"40\" height=\"55\">"
## [360] "\"Options on Portfolios with Higher-Order Moments,\" (2009),"
## [361] "(with Rishabh Bhandari), <I>Finance Research Letters</I>, v6, 122-129. "
## [362] "<a href=\"Papers/tensor.pdf\">[PDF]</a>"
## [363] "<br><I>[How to model fat-tailed portfolio distributions for "
## [364] "options on a multivariate system of assets, calibrated to the return "
## [365] "means, covariance matrix, coskewness and cokurtosis tensors.]</I>"
## [366] "</LI>"
## [367] ""
## [368] ""
## [369] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [370] "\"Dealing with Dimension: Option Pricing on Factor Trees,\" (2009),"
## [371] "(with Brian Granger), <I>Journal of Investment Management</I>, 7(2), 73-85."
## [372] "<a href=\"Papers/faclat.pdf\">[PDF]</a>"
## [373] "<br><I>[Multifactor representations of securities on high-dimensional trees. Allows "
## [374] "you to price options on multiple assets in a unified fraamework. Computational"
## [375] "results assess using multithreading.]</I>"
## [376] "</LI>"
## [377] ""
## [378] ""
## [379] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\"> "
## [380] "\"Modeling"
## [381] "Correlated Default with a Forest of Binomial Trees,\" (2007), (with"
## [382] "Santhosh Bandreddi and Rong Fan), <I>Journal of Fixed"
## [383] "Income</I>. Winter, 1-20."
## [384] "<a href=\"Papers/bscorrdef.pdf\">[PDF]</a>"
## [385] "<br><I>[Extends the Das-Sundaram hybrid securities model to correlated default modeling. ]</I>"
## [386] "</LI>"
## [387] ""
## [388] "<LI><img src=\"graphics/jfsr_cover.jpg\" width=\"40\" height=\"55\">"
## [389] "\"Basel II: Correlation Related Issues\" (2007), "
## [390] "<I>Journal of Financial Services Research</I>, v32, 17-38."
## [391] "<a href=\"Papers/Das_JFSR2007_Basel2.pdf\">[PDF]</a>"
## [392] "<br><I>[Analysis of correlation related issues arising in the implementation"
## [393] "of the Basel II accord.]</I>"
## [394] "</LI>"
## [395] ""
## [396] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"
## [397] "\"Correlated Default Risk,\" (2006),"
## [398] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"
## [399] "<I>Journal of Fixed Income</I>, Fall 2006, 7-32."
## [400] "<a href=\"Papers/DasFreedGengKapadia_JFI2006.pdf\">[PDF]</a>"
## [401] "<br><I>[Empirical evidence on the nature of credit correlations. Correlations"
## [402] "increase as markets worsen. Regime switching models are needed to explain dynamic"
## [403] "correlations.]</I>"
## [404] "</LI>"
## [405] ""
## [406] "<LI><img src=\"graphics/qfcover.gif\" width=\"40\" height=\"55\">"
## [407] "\"A Simple Model for Pricing Equity Options with Markov"
## [408] "Switching State Variables\" (2006),"
## [409] "(with Donald Aingworth and Rajeev Motwani),"
## [410] "<I>Quantitative Finance</I>, v6(2), 95-105."
## [411] "<a href=\"Papers/switch.pdf\">[PDF]</a>"
## [412] "<br><I>[A tree model for options when the underlying has regime switches.]</I>"
## [413] "</LI>"
## [414] ""
## [415] "<LI><img src=\"graphics/mktletters.gif\" width=\"40\" height=\"55\">"
## [416] "\"The Firm's Management of Social Interactions,\" (2005)"
## [417] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "
## [418] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "
## [419] "<I>Marketing Letters</I>, v16, 415-428.Ê"
## [420] "<br><I>[A framework for how word-of-mouth communication is modeled in "
## [421] "the practice of marketing. ]</I>"
## [422] "</LI>"
## [423] ""
## [424] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"
## [425] "\"Financial Communities\" (with Jacob Sisk), 2005, "
## [426] "<i>Journal of Portfolio Management</i>, v31(4), "
## [427] "Summer, 112-123."
## [428] "<a href=\"Papers/fincom.pdf\">[PDF]</a>"
## [429] "<br><I>[Applying graph theory to understanding investor networks to "
## [430] "develop trading rules. ]</I>"
## [431] "</LI>"
## [432] ""
## [433] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [434] "\"Monte Carlo Markov Chain Methods for Derivative Pricing"
## [435] "and Risk Assessment,\"(with Alistair Sinclair), 2005, "
## [436] "<I>Journal of Investment Management</I>, v3(1), 29-44. "
## [437] "<a href=\"https://www.joim.com/ArticleContainer.asp?artid=125&print=false&Key=GQ6!WiJQSJrlrcVJSoeGhEQF7LVNhzfb0M!Nz!0SO5foSMK6!WiHQSJrlrcVJSoeGhEQ\">[PDF]</a>"
## [438] "<br><I>[Randomized algorithm using MCMC on very large option pricing trees"
## [439] "where incomplete information about the value of an asset may be exploited to "
## [440] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "
## [441] "approximation scheme (FPRAS) is available.]</I>"
## [442] "</LI>"
## [443] ""
## [444] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [445] "\"Correlated Default Processes: A Criterion-Based Copula Approach,\""
## [446] "(with Gary Geng), 2004, <I>Journal of Investment Management</I>, v2(2), 44-70,"
## [447] "Special Issue on Default Risk. "
## [448] "<a href=\"https://www.joim.com/ArticleContainer.asp?artid=70&print=false&Key=GQ6!WiJQSJrlrcVJSoeGhEJF7LVNhzfb0M!Nz!0SO5foSMK6!WiHQSJrlrcVJSoeGhEJ\">[PDF]</a>"
## [449] "<br><I>[Which copula and marginal distributions best describe default probability"
## [450] "correlations? Develops models and methodology to answer this question. ]</I>"
## [451] "</LI>"
## [452] ""
## [453] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [454] "\"Private Equity Returns: An Empirical Examination of the Exit of"
## [455] "Venture-Backed Companies,\" (with Murali Jagannathan and Atulya Sarin),"
## [456] "2003, <I>Journal of Investment Management</I>, v1(1), 152-177."
## [457] "<a href=\"Papers/PE_returns.pdf\">[PDF]</a>"
## [458] "<br><I>[Gains from venture-backed investments depend upon the industry, the stage of the"
## [459] "firm being financed, the valuation at the time of financing, and the prevailing market"
## [460] "sentiment. Helps understand the risk premium required for the"
## [461] "valuation of private equity investments ]</I>"
## [462] "</LI>"
## [463] ""
## [464] "<LI><img src=\"graphics/IJISAFM_cover.gif\" width=\"40\" height=\"55\"> \"A"
## [465] "Numerical Algorithm for Consumption/Investment Problems,\" (with Rangarajan"
## [466] "Sundaram), 2002, <I>International Journal of Intelligent"
## [467] "Systems in Accounting, Finance and Management</I>, (Special"
## [468] "Issue on Computational Methods in Economics and Finance), "
## [469] "December, 55-69."
## [470] "<a href=\"Papers/hjb.pdf\">[PDF]</a>"
## [471] "<br><I>[A simple regression approach to solving optimal consumption"
## [472] "and portfolio problems wit diffusions and jumps.]</I>"
## [473] "</LI>"
## [474] ""
## [475] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"
## [476] "\"Bayesian Migration in Credit Ratings Based on Probabilities of"
## [477] "Default,\" (with Rong Fan and Gary Geng), 2002, <I>Journal of"
## [478] "Fixed Income</I>, December, v12(3), 17-23. "
## [479] "<a href=\"Papers/ratingmigr.pdf\">[PDF]</a>"
## [480] "<br><I>[Bayesian model for predicting rating changes based on the"
## [481] "dynamics of default probabilities.]</I>"
## [482] "</LI>"
## [483] ""
## [484] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"
## [485] "\"The Impact of Correlated Default Risk on Credit Portfolios,\""
## [486] "(with Gifford Fong, and Gary Geng),"
## [487] "2001, <i>Journal of Fixed Income</i>, v11(3), 9-19."
## [488] "<br><I>[The connection between credit portfolio loss distributions"
## [489] "and credit correlations. ]</I>"
## [490] "</LI>"
## [491] ""
## [492] "<LI><img src=\"graphics/CIR_cover.jpg\" width=\"40\" height=\"55\">"
## [493] "\"How Diversified are Internationally Diversified Portfolios:"
## [494] "Time-Variation in the Covariances between International Returns,\""
## [495] "1998, (with Raman Uppal), <I>Canadian Investment Review</I>, Spring, 7-11."
## [496] "<a href=\"Papers/DasUppalCIR1998.pdf\">[PDF]</a>"
## [497] "<br><I>[Internation portfolio risk has systemic components. ]</I>"
## [498] "</LI> "
## [499] ""
## [500] "<LI><img src=\"graphics/REDR_cover.gif\" width=\"40\" height=\"55\">"
## [501] "\"Discrete-Time Bond and Option Pricing for Jump-Diffusion"
## [502] "Processes,\" 1997, <I>Review of Derivatives Research</I>, v1(3), 211-244. "
## [503] "<br><I>[Extends the finite-differencing approach for interest rate derivatives"
## [504] "to jump processes.]</I>"
## [505] "</LI>"
## [506] ""
## [507] "<LI><img src=\"graphics/AEL_cover.jpg\" width=\"40\" height=\"55\">"
## [508] "\"Macroeconomic Implications of Search Theory for the Labor Market,\""
## [509] "1997, <I>Applied Economics Letters</I>, December, v4, 719-723."
## [510] "<br><I>[Connects option pricing theory to labor search theory. Calibrates to "
## [511] "labor market data.]</I>"
## [512] "</LI>"
## [513] ""
## [514] "<LI> <img src=\"graphics/FMII_cover.gif\" width=\"40\" height=\"55\">"
## [515] "\"Auction Theory: A Summary with Applications and Evidence"
## [516] "from the Treasury Markets,\" 1996, (with Rangarajan Sundaram),"
## [517] "<I>Financial Markets, Institutions and Instruments</I>, v5(5), 1-36."
## [518] "<a href=\"Papers/DasSundaram_FMII1996_AuctionTheory.pdf\">[PDF]</a>"
## [519] "<br><I>[A survey of models and literature on Treasury Auctions. ]</I>"
## [520] "</LI>"
## [521] ""
## [522] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"
## [523] "\"A Simple Approach to Three Factor Affine Models of the"
## [524] "Term Structure,\" (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"
## [525] "Sundaram), 1996, <I>Journal of Fixed Income</I>, v6(3), 43-53."
## [526] "<br><I>[ An easy way to calibrate three factor models using method of moments. ]</I>"
## [527] "</LI>"
## [528] ""
## [529] "<LI> <img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"
## [530] "\"Analytical Approximations of the Term Structure"
## [531] "for Jump-diffusion Processes: A Numerical Analysis,\" 1996, "
## [532] "(with Jamil Baz), <I>Journal of Fixed Income</I>, v6(1), 78-86. "
## [533] "<br><I>[An exact solution to an approximate PDE may be better than "
## [534] "an approximate solution to an exact PDDE for term structure models. ]</I>"
## [535] "</LI>"
## [536] ""
## [537] "<LI> <img src=\"graphics/JAF_cover.jpg\" width=\"40\" height=\"55\"> \"Revisiting"
## [538] "Markov Chain Term Structure Models: Extensions and Applications,\""
## [539] "1996, <I>Financial Practice and Education</I>, v6(1), 33-45. "
## [540] "<br><I>[A new pedagogy for Markov models of interest rates. ]</I>"
## [541] "</LI>"
## [542] ""
## [543] ""
## [544] "<LI> <img src=\"graphics/REDR_cover.gif\" width=\"40\" height=\"55\">"
## [545] "\"Exact Solutions for Bond and Options Prices"
## [546] "with Systematic Jump Risk,\" 1996, (with Silverio Foresi),"
## [547] "<I>Review of Derivatives Research</I>, v1(1), 7-24. "
## [548] "<a href=\"Papers/DasForesiREDR1996.pdf\">[PDF]</a>"
## [549] "<br><I>[First paper to show that affine solutions exist for "
## [550] "jump-diffusion term structure models.]</I>"
## [551] "</LI>"
## [552] ""
## [553] "<LI> <img src=\"graphics/JOD_cover.gif\" width=\"40\" height=\"55\">"
## [554] "\"Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"
## [555] "and Credit Spreads are Stochastic,\" 1996, "
## [556] "(with Peter Tufano), <I>The Journal of Financial Engineering</I>,"
## [557] "v5(2), 161-198."
## [558] "<a href=\"Papers/DasTufanoJFE1996.pdf\">[PDF]</a>"
## [559] "<br><I>[Rating based model for credit derivatives with correlation between recovery "
## [560] "rates, interest rates and default probabilities. ]</I>"
## [561] "</LI>"
## [562] ""
## [563] "<LI> <img src=\"graphics/JOD_cover.gif\" width=\"40\" height=\"55\">"
## [564] "\"Credit Risk Derivatives,\" <I>Journal of Derivatives</I>, 1995, pg 7-21. "
## [565] "<a href=\"Papers/Das-JOD1995.pdf\">[PDF]</a>"
## [566] "<br><I>[Introduces early models for pricing credit derivatives as compound options. ]</I>"
## [567] "</LI>"
## [568] ""
## [569] ""
## [570] ""
## [571] ""
## [572] ""
## [573] "<H2>SHORTER ARTICLES and BOOK CHAPTERS (Mostly Non-refereed, Invited)</H2>"
## [574] ""
## [575] "<LI><img src=\"graphics/fame.jpg\" width=\"40\" height=\"55\">"
## [576] "\"Did CDS Trading Improve the Market for Corporate Bonds,\" (2016), "
## [577] "(with Madhu Kalimipalli and Subhankar Nayak), "
## [578] "<I>Finance and Accounting Memos</I> Issue 3, 45--49. "
## [579] "<a href=\"Papers/fame-3.pdf\">[PDF]</a>"
## [580] "<br><I>[CDS trading adversely impacted the bond market.]</I>"
## [581] "</LI> "
## [582] ""
## [583] "<LI><img src=\"graphics/FD.png\" width=\"40\" height=\"55\">"
## [584] "\"Big Data's Big Muscle,\" (2016), "
## [585] "<I>Finance and Development (IMF)</I>, September, 14(2), 26-28."
## [586] "<a href=\"Papers/FD_BigData.pdf\">[PDF]</a>"
## [587] "<br><I>[Economics in the machine age.]</I>"
## [588] "</LI> "
## [589] ""
## [590] ""
## [591] "<LI><img src=\"graphics/jwm.jpg\" width=\"40\" height=\"55\">"
## [592] "\"Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier,\" (2011), "
## [593] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "
## [594] "<I>Journal of Wealth Management</I>, Fall, 14(2), 25-31."
## [595] "<br><I>[A framework for goal driven mental accounting and behavioral portfolio allocation that extends mean-variance portfolios.]</I>"
## [596] "</LI> "
## [597] ""
## [598] "<LI><img src=\"graphics/HNAF_Wiley.jpg\" width=\"40\" height=\"55\">"
## [599] "\"News Analytics: Framework, Techniques and Metrics,\" The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "
## [600] "<a href=\"Papers/newsmetrics.pdf\">[PDF]</a>"
## [601] "</LI>"
## [602] ""
## [603] ""
## [604] ""
## [605] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [606] "\"Random Lattices for Option Pricing Problems in Finance,\" (2011),"
## [607] "<I>Journal of Investment Management</I>, 9(2), 88-106."
## [608] "<a href=\"Papers/randlatt.pdf\">[PDF]</a>"
## [609] "</LI>"
## [610] ""
## [611] ""
## [612] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [613] "\"Implementing Option Pricing Models using Python and Cython,\" (2010),"
## [614] "(with Brian Granger), <I>Journal of Investment Management</I>, 9(4), 72-84"
## [615] "<a href=\"Papers/cython.pdf\">[PDF]</a>"
## [616] "</LI>"
## [617] ""
## [618] ""
## [619] ""
## [620] "<LI><img src=\"graphics/IEEE_IS_cover.jpg\" width=\"40\" height=\"55\">"
## [621] "\"The Finance Web: Internet Information and Markets,\" (2010), "
## [622] "<I>IEEE Intelligent Systems</I>, 25(2), Mar/Apr, 74--78. "
## [623] "</LI>"
## [624] ""
## [625] ""
## [626] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [627] "\"Financial Applications with Parallel R,\" (2009), "
## [628] "(with Brian Granger), <I>Journal of Investment Management</I>, 7(4), 66-77"
## [629] "<a href=\"Papers/parallelr_options.pdf\">[PDF]</a>"
## [630] "</LI>"
## [631] ""
## [632] ""
## [633] "<LI><img src=\"graphics/EQF.jpg\" width=\"40\" height=\"55\">"
## [634] "\"Recovery Swaps,\" (2009), (with Paul Hanouna), "
## [635] "<I>Encyclopedia of Quantitative Finance</I>, John Wiley and Sons, U.K., 1507--1509 "
## [636] ""
## [637] "<LI><img src=\"graphics/EQF.jpg\" width=\"40\" height=\"55\">"
## [638] "\"Recovery Rates,\" (2009),(with Paul Hanouna), "
## [639] "<I>Encyclopedia of Quantitative Finance</I>, John Wiley and Sons, U.K., 1505--1507"
## [640] ""
## [641] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [642] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "
## [643] "<I> Innovations in Investment Management</I>, Bloomberg Press, 85-112."
## [644] ""
## [645] ""
## [646] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [647] "\"Credit Default Swap Spreads\", 2006, (with Paul Hanouna), "
## [648] "<I>Journal of Investment Management</I>, v4(3), 93-105."
## [649] "</LI>"
## [650] ""
## [651] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [652] "\"Multiple-Core Processors for Finance Applications,\" 2006, "
## [653] "<I>Journal of Investment Management</I>, v4(2), 76-81."
## [654] "</LI>"
## [655] ""
## [656] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [657] "\"Power Laws,\" 2005, (with Jacob Sisk), "
## [658] "<I>Journal of Investment Management</I>, v3(3), 84-91."
## [659] "<a href=\"https://www.joim.com/ArticleContainer.asp?artID=154\">[PDF]</a>"
## [660] "</LI>"
## [661] ""
## [662] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [663] "\"Genetic Algorithms,\" 2005,"
## [664] "<I>Journal of Investment Management</I>, v3(2), 77-82."
## [665] "</LI>"
## [666] ""
## [667] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [668] "\"Recovery Risk,\" 2005,"
## [669] "<I>Journal of Investment Management</I>, v3(1), 113-120."
## [670] "</LI>"
## [671] ""
## [672] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [673] "\"Venture Capital Syndication\", (with Hoje Jo and Yongtae Kim), 2004"
## [674] "<I>Journal of Investment Management</I>, v2(4), 132-143."
## [675] "</LI>"
## [676] ""
## [677] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [678] "\"Technical Analysis\", (with David Tien), 2004"
## [679] "<I>Journal of Investment Management</I>, v2(1), 79-85."
## [680] "</LI>"
## [681] ""
## [682] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [683] "\"Liquidity and the Bond Markets, (with Jan Ericsson and "
## [684] "Madhu Kalimipalli), 2003,"
## [685] "<I>Journal of Investment Management</I>, v1(4), 95-103."
## [686] "</LI>"
## [687] ""
## [688] "<LI><img src=\"graphics/JEL_cover.jpg\" width=\"40\" height=\"55\">"
## [689] "\"Modern Pricing of Interest Rate Derivatives - Book Review\", "
## [690] "2004, <I>Journal of Economic Literature</I>, vXLII, 528-529."
## [691] ""
## [692] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [693] "\"Contagion\", 2003,"
## [694] "<I>Journal of Investment Management</I>, v1(3), 78-84."
## [695] "</LI>"
## [696] ""
## [697] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [698] "\"Hedge Funds\", 2003,"
## [699] "<I>Journal of Investment Management</I>, v1(2), 76-81."
## [700] "Reprinted in "
## [701] "\"Working Papers on Hedge Funds,\" in The World of Hedge Funds: "
## [702] "Characteristics and "
## [703] "Analysis, 2005, World Scientific."
## [704] "</LI>"
## [705] ""
## [706] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"
## [707] "\"The Internet and Investors\", 2003,"
## [708] "<I>Journal of Investment Management</I>, v1(1), 213-217."
## [709] "</LI>"
## [710] ""
## [711] "<LI><img src=\"graphics/EC_cover.gif\">"
## [712] " \"Useful things to know about Correlated Default Risk,\""
## [713] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"
## [714] "2001, <i>Extra Credit</i>, November-December, 14-23."
## [715] "</LI>"
## [716] ""
## [717] "<LI><img src=\"graphics/QAFM_cover.jpg\" width=\"40\" height=\"55\">"
## [718] "\"The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "
## [719] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"
## [720] "Courant Institute of Mathematical Sciences, special volume on"
## [721] "<I>Quantitative Analysis in Financial Markets</I>, Volume III, 2001."
## [722] "</LI>"
## [723] ""
## [724] "<LI><img src=\"graphics/QAFM_cover.jpg\" width=\"40\" height=\"55\">"
## [725] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "
## [726] "(with Rangarajan Sundaram), reprinted in "
## [727] "the Courant Institute of Mathematical Sciences, special volume on"
## [728] "<I>Quantitative Analysis in Financial Markets</I>, Volume III, 2001."
## [729] "</LI>"
## [730] ""
## [731] "<LI><img src=\"graphics/AFIVT_cover.jpg\" width=\"40\" height=\"55\">"
## [732] "\"Stochastic Mean Models of the Term Structure,''"
## [733] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "
## [734] "2000, <I>Advanced Fixed-Income Valuation Tools"
## [735] "</I>, edited by N. Jegadeesh and B. Tuckman,"
## [736] "John Wiley & Sons, Inc., 128-161."
## [737] "</LI>"
## [738] ""
## [739] "<LI><img src=\"graphics/AFIVT_cover.jpg\" width=\"40\" height=\"55\">"
## [740] "\"Interest Rate Modeling with Jump-Diffusion Processes,'' "
## [741] "2000, <I>Advanced Fixed-Income Valuation Tools"
## [742] "</I>, edited by N. Jegadeesh and B. Tuckman,"
## [743] "John Wiley & Sons, Inc., 162-189."
## [744] "</LI>"
## [745] ""
## [746] "<LI><img src=\"graphics/FCR_cover.jpg\" width=\"40\" height=\"55\">"
## [747] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"
## [748] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"
## [749] "in <I>The Financing of Catastrophe Risk</I>, Kenneth A"
## [750] "Froot (Ed.), University of Chicago Press, 1999, 141-145."
## [751] "</LI>"
## [752] ""
## [753] "<LI><img src=\"graphics/HCD_cover.jpg\" width=\"40\" height=\"55\">"
## [754] " \"Pricing Credit Derivatives,'' "
## [755] "1999, <I>Handbook of Credit Derivatives</I>, eds J. Francis,"
## [756] "J. Frost and J.G. Whittaker, 101-138."
## [757] "</LI>"
## [758] ""
## [759] "<LI><img src=\"graphics/PEC_cover.gif\" width=\"40\" height=\"55\">"
## [760] "\"On the Recursive Implementation of Term Structure Models,'' "
## [761] "1998, <I>Pecunia</I>, The Netherlands, Summer 1998, 45-49."
## [762] "</LI>"
## [763] ""
## [764] ""
## [765] "</OL>"
## [766] ""
## [767] ""
## [768] "<H2>WORKING PAPERS</H2>"
## [769] ""
## [770] "<OL>"
## [771] ""
## [772] "<LI><img src=\"graphics/frog2.gif\">"
## [773] "”Local Volatility and the Recovery Rate of Credit Default Swaps”, "
## [774] "(with Jeroen Jansen and Frank Fabozzi)."
## [775] "<a href=\"Papers/LocalVolatility.pdf\">[PDF]</a>. "
## [776] ""
## [777] "<LI><img src=\"graphics/frog2.gif\">"
## [778] "\"Efficient Rebalancing of Taxable Portfolios\" (with Dan Ostrov, Dennis Ding, Vincent Newell), "
## [779] "<a href=\"Papers/taxopt.pdf\">[PDF]</a>. "
## [780] "<a href=\"Papers/taxopt_slides_RFinance_2015_05.pdf\">[SLIDES RFinance]</a>. "
## [781] "<a href=\"Papers/taxopt_slides2.pdf\">[SLIDES JOIM]</a>. "
## [782] ""
## [783] ""
## [784] "<LI><img src=\"graphics/frog2.gif\">"
## [785] "\"The Fast and the Curious: VC Drift\" "
## [786] "(with Amit Bubna and Paul Hanouna), "
## [787] "<a href=\"Papers/vcstyle.pdf\">[PDF]</a>"
## [788] ""
## [789] ""
## [790] "<LI><img src=\"graphics/frog2.gif\">"
## [791] "\"Venture Capital Communities\" (with Amit Bubna and Nagpurnanand Prabhala), "
## [792] "<a href=\"Papers/vccomm.pdf\">[PDF]</a>"
## [793] ""
## [794] ""
## [795] ""
## [796] ""
## [797] ""
## [798] "</OL>"
## [799] ""
## [800] ""
## [801] ""
## [802] ""
## [803] ""
## [804] ""
## [805] ""
## [806] ""
## [807] "</UL>"
## [808] "<p>"
## [809] "My page on SSRN (with downloadable papers) is <a"
## [810] "href=\"http://ssrn.com/author=17108\">here</a>."
## [811] ""
## [812] ""
## [813] ""
## [814] " "
## [815] ""
## [816] ""
## [817] ""
## [818] "</BODY>"
## [819] ""
## [820] "</HTML>"
## [821] ""
## [822] ""
## [823] ""
text = text[setdiff(seq(1,length(text)),grep("<",text))]
text = text[setdiff(seq(1,length(text)),grep(">",text))]
text = text[setdiff(seq(1,length(text)),grep("]",text))]
text = text[setdiff(seq(1,length(text)),grep("}",text))]
text = text[setdiff(seq(1,length(text)),grep("_",text))]
text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
print(length(text))## [1] 336
print(text)## [1] ""
## [2] ""
## [3] ""
## [4] "\"Data Science: Theories, Models, Algorithms, and Analytics\" (web book -- work in progress)"
## [5] ""
## [6] ""
## [7] "\"Derivatives: Principles and Practice\" (2010),"
## [8] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."
## [9] ""
## [10] ""
## [11] ""
## [12] ""
## [13] "\"An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."
## [14] ""
## [15] "\"Matrix Metrics: Network-Based Systemic Risk Scoring\", (2016)."
## [16] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "
## [17] "the best paper on SIFIs (systemically important financial institutions). "
## [18] "It also won the best paper award at "
## [19] ""
## [20] ""
## [21] ""
## [22] ""
## [23] "\"Credit Spreads with Dynamic Debt\" (with Seoyoung Kim), (2015), "
## [24] ""
## [25] "\"Text and Context: Language Analytics for Finance\", (2014),"
## [26] ""
## [27] ""
## [28] ""
## [29] "\"Strategic Loan Modification: An Options-Based Response to Strategic Default,\""
## [30] ""
## [31] ""
## [32] "\"Options and Structured Products in Behavioral Portfolios,\" (with Meir Statman), (2013), "
## [33] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."
## [34] ""
## [35] ""
## [36] ""
## [37] "\"Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance,\" (2011), (with Hoje Jo and Yongtae Kim), "
## [38] ""
## [39] "Optimization with Mental Accounts,\" (2010), (with Harry Markowitz, Jonathan"
## [40] ""
## [41] ""
## [42] ""
## [43] ""
## [44] ""
## [45] ""
## [46] ""
## [47] ""
## [48] "\"Accounting-based versus market-based cross-sectional models of CDS spreads,\" "
## [49] "(with Paul Hanouna and Atulya Sarin), (2009), "
## [50] ""
## [51] ""
## [52] "\"Hedging Credit: Equity Liquidity Matters,\" (with Paul Hanouna), (2009),"
## [53] ""
## [54] "\"An Integrated Model for Hybrid Securities,\""
## [55] ""
## [56] "\"Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,\""
## [57] ""
## [58] "\"Common Failings: How Corporate Defaults are Correlated\" "
## [59] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."
## [60] ""
## [61] "\"A Clinical Study of Investor Discussion and Sentiment,\" "
## [62] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "
## [63] ""
## [64] "\"International Portfolio Choice with Systemic Risk,\""
## [65] "The loss resulting from diminished diversification is small, while"
## [66] ""
## [67] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"
## [68] "investor welfare. Contrary to regulatory intuition, incentive structures"
## [69] ""
## [70] "\"A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"
## [71] "with Rating Transitions,\" (with Viral Acharya and Rangarajan Sundaram),"
## [72] ""
## [73] ""
## [74] "\"Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"
## [75] ""
## [76] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "
## [77] ""
## [78] "\"The Psychology of Financial Decision Making: A Case"
## [79] "for Theory-Driven Experimental Enquiry,''"
## [80] "1999, (with Priya Raghubir),"
## [81] ""
## [82] "\"Of Smiles and Smirks: A Term Structure Perspective,''"
## [83] ""
## [84] "\"A Theory of Banking Structure,\" 1999, (with Ashish Nanda),"
## [85] "by function based upon two dimensions: the degree of information asymmetry "
## [86] ""
## [87] "\"A Theory of Optimal Timing and Selectivity,'' "
## [88] ""
## [89] "\"A Direct Discrete-Time Approach to"
## [90] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "
## [91] ""
## [92] "\"The Central Tendency: A Second Factor in"
## [93] "Bond Yields,\" 1998, (with Silverio Foresi and Pierluigi Balduzzi), "
## [94] ""
## [95] "\"Efficiency with Costly Information: A Reinterpretation of"
## [96] "Evidence from Managed Portfolios,\" (with Edwin Elton, Martin Gruber and Matt "
## [97] "Presented and Reprinted in the Proceedings of The "
## [98] "Seminar on the Analysis of Security Prices at the Center "
## [99] "for Research in Security Prices at the University of "
## [100] ""
## [101] ""
## [102] ""
## [103] ""
## [104] ""
## [105] ""
## [106] ""
## [107] "\"Managing Rollover Risk with Capital Structure Covenants"
## [108] "in Structured Finance Vehicles\" (2016),"
## [109] ""
## [110] ""
## [111] ""
## [112] "\"The Design and Risk Management of Structured Finance Vehicles\" (2016),"
## [113] "Post the recent subprime financial crisis, we inform the creation of safer SIVs "
## [114] "in structured finance, and propose avenues of mitigating risks faced by senior debt through "
## [115] ""
## [116] ""
## [117] "\"Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework\" (2014), "
## [118] "(with Seoyoung Kim and Meir Statman), "
## [119] ""
## [120] ""
## [121] "\"Going for Broke: Restructuring Distressed Debt Portfolios\" (2014),"
## [122] ""
## [123] ""
## [124] "\"Digital Portfolios.\" (2013), "
## [125] ""
## [126] ""
## [127] "\"Options on Portfolios with Higher-Order Moments,\" (2009),"
## [128] "options on a multivariate system of assets, calibrated to the return "
## [129] ""
## [130] ""
## [131] "\"Dealing with Dimension: Option Pricing on Factor Trees,\" (2009),"
## [132] "you to price options on multiple assets in a unified fraamework. Computational"
## [133] ""
## [134] ""
## [135] "\"Modeling"
## [136] "Correlated Default with a Forest of Binomial Trees,\" (2007), (with"
## [137] ""
## [138] "\"Basel II: Correlation Related Issues\" (2007), "
## [139] ""
## [140] "\"Correlated Default Risk,\" (2006),"
## [141] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"
## [142] "increase as markets worsen. Regime switching models are needed to explain dynamic"
## [143] ""
## [144] "\"A Simple Model for Pricing Equity Options with Markov"
## [145] "Switching State Variables\" (2006),"
## [146] "(with Donald Aingworth and Rajeev Motwani),"
## [147] ""
## [148] "\"The Firm's Management of Social Interactions,\" (2005)"
## [149] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "
## [150] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "
## [151] ""
## [152] "\"Financial Communities\" (with Jacob Sisk), 2005, "
## [153] "Summer, 112-123."
## [154] ""
## [155] "\"Monte Carlo Markov Chain Methods for Derivative Pricing"
## [156] "and Risk Assessment,\"(with Alistair Sinclair), 2005, "
## [157] "where incomplete information about the value of an asset may be exploited to "
## [158] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "
## [159] ""
## [160] "\"Correlated Default Processes: A Criterion-Based Copula Approach,\""
## [161] "Special Issue on Default Risk. "
## [162] ""
## [163] "\"Private Equity Returns: An Empirical Examination of the Exit of"
## [164] "Venture-Backed Companies,\" (with Murali Jagannathan and Atulya Sarin),"
## [165] "firm being financed, the valuation at the time of financing, and the prevailing market"
## [166] "sentiment. Helps understand the risk premium required for the"
## [167] ""
## [168] "Issue on Computational Methods in Economics and Finance), "
## [169] "December, 55-69."
## [170] ""
## [171] "\"Bayesian Migration in Credit Ratings Based on Probabilities of"
## [172] ""
## [173] "\"The Impact of Correlated Default Risk on Credit Portfolios,\""
## [174] "(with Gifford Fong, and Gary Geng),"
## [175] ""
## [176] "\"How Diversified are Internationally Diversified Portfolios:"
## [177] "Time-Variation in the Covariances between International Returns,\""
## [178] ""
## [179] "\"Discrete-Time Bond and Option Pricing for Jump-Diffusion"
## [180] ""
## [181] "\"Macroeconomic Implications of Search Theory for the Labor Market,\""
## [182] ""
## [183] "\"Auction Theory: A Summary with Applications and Evidence"
## [184] "from the Treasury Markets,\" 1996, (with Rangarajan Sundaram),"
## [185] ""
## [186] "\"A Simple Approach to Three Factor Affine Models of the"
## [187] "Term Structure,\" (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"
## [188] ""
## [189] "\"Analytical Approximations of the Term Structure"
## [190] "for Jump-diffusion Processes: A Numerical Analysis,\" 1996, "
## [191] ""
## [192] "Markov Chain Term Structure Models: Extensions and Applications,\""
## [193] ""
## [194] ""
## [195] "\"Exact Solutions for Bond and Options Prices"
## [196] "with Systematic Jump Risk,\" 1996, (with Silverio Foresi),"
## [197] ""
## [198] "\"Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"
## [199] "and Credit Spreads are Stochastic,\" 1996, "
## [200] "v5(2), 161-198."
## [201] ""
## [202] ""
## [203] ""
## [204] ""
## [205] ""
## [206] ""
## [207] ""
## [208] "\"Did CDS Trading Improve the Market for Corporate Bonds,\" (2016), "
## [209] "(with Madhu Kalimipalli and Subhankar Nayak), "
## [210] ""
## [211] "\"Big Data's Big Muscle,\" (2016), "
## [212] ""
## [213] ""
## [214] "\"Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier,\" (2011), "
## [215] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "
## [216] ""
## [217] "\"News Analytics: Framework, Techniques and Metrics,\" The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "
## [218] ""
## [219] ""
## [220] ""
## [221] "\"Random Lattices for Option Pricing Problems in Finance,\" (2011),"
## [222] ""
## [223] ""
## [224] "\"Implementing Option Pricing Models using Python and Cython,\" (2010),"
## [225] ""
## [226] ""
## [227] ""
## [228] "\"The Finance Web: Internet Information and Markets,\" (2010), "
## [229] ""
## [230] ""
## [231] "\"Financial Applications with Parallel R,\" (2009), "
## [232] ""
## [233] ""
## [234] "\"Recovery Swaps,\" (2009), (with Paul Hanouna), "
## [235] ""
## [236] "\"Recovery Rates,\" (2009),(with Paul Hanouna), "
## [237] ""
## [238] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "
## [239] ""
## [240] ""
## [241] "\"Credit Default Swap Spreads\", 2006, (with Paul Hanouna), "
## [242] ""
## [243] "\"Multiple-Core Processors for Finance Applications,\" 2006, "
## [244] ""
## [245] "\"Power Laws,\" 2005, (with Jacob Sisk), "
## [246] ""
## [247] "\"Genetic Algorithms,\" 2005,"
## [248] ""
## [249] "\"Recovery Risk,\" 2005,"
## [250] ""
## [251] "\"Venture Capital Syndication\", (with Hoje Jo and Yongtae Kim), 2004"
## [252] ""
## [253] "\"Technical Analysis\", (with David Tien), 2004"
## [254] ""
## [255] "\"Liquidity and the Bond Markets, (with Jan Ericsson and "
## [256] "Madhu Kalimipalli), 2003,"
## [257] ""
## [258] "\"Modern Pricing of Interest Rate Derivatives - Book Review\", "
## [259] ""
## [260] "\"Contagion\", 2003,"
## [261] ""
## [262] "\"Hedge Funds\", 2003,"
## [263] "Reprinted in "
## [264] "\"Working Papers on Hedge Funds,\" in The World of Hedge Funds: "
## [265] "Characteristics and "
## [266] "Analysis, 2005, World Scientific."
## [267] ""
## [268] "\"The Internet and Investors\", 2003,"
## [269] ""
## [270] " \"Useful things to know about Correlated Default Risk,\""
## [271] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"
## [272] ""
## [273] "\"The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "
## [274] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"
## [275] "Courant Institute of Mathematical Sciences, special volume on"
## [276] ""
## [277] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "
## [278] "(with Rangarajan Sundaram), reprinted in "
## [279] "the Courant Institute of Mathematical Sciences, special volume on"
## [280] ""
## [281] "\"Stochastic Mean Models of the Term Structure,''"
## [282] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "
## [283] "John Wiley & Sons, Inc., 128-161."
## [284] ""
## [285] "\"Interest Rate Modeling with Jump-Diffusion Processes,'' "
## [286] "John Wiley & Sons, Inc., 162-189."
## [287] ""
## [288] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"
## [289] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"
## [290] "Froot (Ed.), University of Chicago Press, 1999, 141-145."
## [291] ""
## [292] " \"Pricing Credit Derivatives,'' "
## [293] "J. Frost and J.G. Whittaker, 101-138."
## [294] ""
## [295] "\"On the Recursive Implementation of Term Structure Models,'' "
## [296] ""
## [297] ""
## [298] ""
## [299] ""
## [300] ""
## [301] ""
## [302] "”Local Volatility and the Recovery Rate of Credit Default Swaps”, "
## [303] "(with Jeroen Jansen and Frank Fabozzi)."
## [304] ""
## [305] "\"Efficient Rebalancing of Taxable Portfolios\" (with Dan Ostrov, Dennis Ding, Vincent Newell), "
## [306] ""
## [307] ""
## [308] "\"The Fast and the Curious: VC Drift\" "
## [309] "(with Amit Bubna and Paul Hanouna), "
## [310] ""
## [311] ""
## [312] "\"Venture Capital Communities\" (with Amit Bubna and Nagpurnanand Prabhala), "
## [313] ""
## [314] ""
## [315] ""
## [316] ""
## [317] ""
## [318] ""
## [319] ""
## [320] ""
## [321] ""
## [322] ""
## [323] ""
## [324] ""
## [325] ""
## [326] ""
## [327] ""
## [328] ""
## [329] " "
## [330] ""
## [331] ""
## [332] ""
## [333] ""
## [334] ""
## [335] ""
## [336] ""
text = str_replace_all(text,"[\"]","")
idx = which(nchar(text)==0)
research = text[setdiff(seq(1,length(text)),idx)]
print(research)## [1] "Data Science: Theories, Models, Algorithms, and Analytics (web book -- work in progress)"
## [2] "Derivatives: Principles and Practice (2010),"
## [3] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."
## [4] "An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."
## [5] "Matrix Metrics: Network-Based Systemic Risk Scoring, (2016)."
## [6] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "
## [7] "the best paper on SIFIs (systemically important financial institutions). "
## [8] "It also won the best paper award at "
## [9] "Credit Spreads with Dynamic Debt (with Seoyoung Kim), (2015), "
## [10] "Text and Context: Language Analytics for Finance, (2014),"
## [11] "Strategic Loan Modification: An Options-Based Response to Strategic Default,"
## [12] "Options and Structured Products in Behavioral Portfolios, (with Meir Statman), (2013), "
## [13] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."
## [14] "Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance, (2011), (with Hoje Jo and Yongtae Kim), "
## [15] "Optimization with Mental Accounts, (2010), (with Harry Markowitz, Jonathan"
## [16] "Accounting-based versus market-based cross-sectional models of CDS spreads, "
## [17] "(with Paul Hanouna and Atulya Sarin), (2009), "
## [18] "Hedging Credit: Equity Liquidity Matters, (with Paul Hanouna), (2009),"
## [19] "An Integrated Model for Hybrid Securities,"
## [20] "Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,"
## [21] "Common Failings: How Corporate Defaults are Correlated "
## [22] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."
## [23] "A Clinical Study of Investor Discussion and Sentiment, "
## [24] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "
## [25] "International Portfolio Choice with Systemic Risk,"
## [26] "The loss resulting from diminished diversification is small, while"
## [27] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"
## [28] "investor welfare. Contrary to regulatory intuition, incentive structures"
## [29] "A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"
## [30] "with Rating Transitions, (with Viral Acharya and Rangarajan Sundaram),"
## [31] "Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"
## [32] "A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "
## [33] "The Psychology of Financial Decision Making: A Case"
## [34] "for Theory-Driven Experimental Enquiry,''"
## [35] "1999, (with Priya Raghubir),"
## [36] "Of Smiles and Smirks: A Term Structure Perspective,''"
## [37] "A Theory of Banking Structure, 1999, (with Ashish Nanda),"
## [38] "by function based upon two dimensions: the degree of information asymmetry "
## [39] "A Theory of Optimal Timing and Selectivity,'' "
## [40] "A Direct Discrete-Time Approach to"
## [41] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "
## [42] "The Central Tendency: A Second Factor in"
## [43] "Bond Yields, 1998, (with Silverio Foresi and Pierluigi Balduzzi), "
## [44] "Efficiency with Costly Information: A Reinterpretation of"
## [45] "Evidence from Managed Portfolios, (with Edwin Elton, Martin Gruber and Matt "
## [46] "Presented and Reprinted in the Proceedings of The "
## [47] "Seminar on the Analysis of Security Prices at the Center "
## [48] "for Research in Security Prices at the University of "
## [49] "Managing Rollover Risk with Capital Structure Covenants"
## [50] "in Structured Finance Vehicles (2016),"
## [51] "The Design and Risk Management of Structured Finance Vehicles (2016),"
## [52] "Post the recent subprime financial crisis, we inform the creation of safer SIVs "
## [53] "in structured finance, and propose avenues of mitigating risks faced by senior debt through "
## [54] "Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework (2014), "
## [55] "(with Seoyoung Kim and Meir Statman), "
## [56] "Going for Broke: Restructuring Distressed Debt Portfolios (2014),"
## [57] "Digital Portfolios. (2013), "
## [58] "Options on Portfolios with Higher-Order Moments, (2009),"
## [59] "options on a multivariate system of assets, calibrated to the return "
## [60] "Dealing with Dimension: Option Pricing on Factor Trees, (2009),"
## [61] "you to price options on multiple assets in a unified fraamework. Computational"
## [62] "Modeling"
## [63] "Correlated Default with a Forest of Binomial Trees, (2007), (with"
## [64] "Basel II: Correlation Related Issues (2007), "
## [65] "Correlated Default Risk, (2006),"
## [66] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"
## [67] "increase as markets worsen. Regime switching models are needed to explain dynamic"
## [68] "A Simple Model for Pricing Equity Options with Markov"
## [69] "Switching State Variables (2006),"
## [70] "(with Donald Aingworth and Rajeev Motwani),"
## [71] "The Firm's Management of Social Interactions, (2005)"
## [72] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "
## [73] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "
## [74] "Financial Communities (with Jacob Sisk), 2005, "
## [75] "Summer, 112-123."
## [76] "Monte Carlo Markov Chain Methods for Derivative Pricing"
## [77] "and Risk Assessment,(with Alistair Sinclair), 2005, "
## [78] "where incomplete information about the value of an asset may be exploited to "
## [79] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "
## [80] "Correlated Default Processes: A Criterion-Based Copula Approach,"
## [81] "Special Issue on Default Risk. "
## [82] "Private Equity Returns: An Empirical Examination of the Exit of"
## [83] "Venture-Backed Companies, (with Murali Jagannathan and Atulya Sarin),"
## [84] "firm being financed, the valuation at the time of financing, and the prevailing market"
## [85] "sentiment. Helps understand the risk premium required for the"
## [86] "Issue on Computational Methods in Economics and Finance), "
## [87] "December, 55-69."
## [88] "Bayesian Migration in Credit Ratings Based on Probabilities of"
## [89] "The Impact of Correlated Default Risk on Credit Portfolios,"
## [90] "(with Gifford Fong, and Gary Geng),"
## [91] "How Diversified are Internationally Diversified Portfolios:"
## [92] "Time-Variation in the Covariances between International Returns,"
## [93] "Discrete-Time Bond and Option Pricing for Jump-Diffusion"
## [94] "Macroeconomic Implications of Search Theory for the Labor Market,"
## [95] "Auction Theory: A Summary with Applications and Evidence"
## [96] "from the Treasury Markets, 1996, (with Rangarajan Sundaram),"
## [97] "A Simple Approach to Three Factor Affine Models of the"
## [98] "Term Structure, (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"
## [99] "Analytical Approximations of the Term Structure"
## [100] "for Jump-diffusion Processes: A Numerical Analysis, 1996, "
## [101] "Markov Chain Term Structure Models: Extensions and Applications,"
## [102] "Exact Solutions for Bond and Options Prices"
## [103] "with Systematic Jump Risk, 1996, (with Silverio Foresi),"
## [104] "Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"
## [105] "and Credit Spreads are Stochastic, 1996, "
## [106] "v5(2), 161-198."
## [107] "Did CDS Trading Improve the Market for Corporate Bonds, (2016), "
## [108] "(with Madhu Kalimipalli and Subhankar Nayak), "
## [109] "Big Data's Big Muscle, (2016), "
## [110] "Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier, (2011), "
## [111] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "
## [112] "News Analytics: Framework, Techniques and Metrics, The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "
## [113] "Random Lattices for Option Pricing Problems in Finance, (2011),"
## [114] "Implementing Option Pricing Models using Python and Cython, (2010),"
## [115] "The Finance Web: Internet Information and Markets, (2010), "
## [116] "Financial Applications with Parallel R, (2009), "
## [117] "Recovery Swaps, (2009), (with Paul Hanouna), "
## [118] "Recovery Rates, (2009),(with Paul Hanouna), "
## [119] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "
## [120] "Credit Default Swap Spreads, 2006, (with Paul Hanouna), "
## [121] "Multiple-Core Processors for Finance Applications, 2006, "
## [122] "Power Laws, 2005, (with Jacob Sisk), "
## [123] "Genetic Algorithms, 2005,"
## [124] "Recovery Risk, 2005,"
## [125] "Venture Capital Syndication, (with Hoje Jo and Yongtae Kim), 2004"
## [126] "Technical Analysis, (with David Tien), 2004"
## [127] "Liquidity and the Bond Markets, (with Jan Ericsson and "
## [128] "Madhu Kalimipalli), 2003,"
## [129] "Modern Pricing of Interest Rate Derivatives - Book Review, "
## [130] "Contagion, 2003,"
## [131] "Hedge Funds, 2003,"
## [132] "Reprinted in "
## [133] "Working Papers on Hedge Funds, in The World of Hedge Funds: "
## [134] "Characteristics and "
## [135] "Analysis, 2005, World Scientific."
## [136] "The Internet and Investors, 2003,"
## [137] " Useful things to know about Correlated Default Risk,"
## [138] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"
## [139] "The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "
## [140] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"
## [141] "Courant Institute of Mathematical Sciences, special volume on"
## [142] "A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "
## [143] "(with Rangarajan Sundaram), reprinted in "
## [144] "the Courant Institute of Mathematical Sciences, special volume on"
## [145] "Stochastic Mean Models of the Term Structure,''"
## [146] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "
## [147] "John Wiley & Sons, Inc., 128-161."
## [148] "Interest Rate Modeling with Jump-Diffusion Processes,'' "
## [149] "John Wiley & Sons, Inc., 162-189."
## [150] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"
## [151] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"
## [152] "Froot (Ed.), University of Chicago Press, 1999, 141-145."
## [153] " Pricing Credit Derivatives,'' "
## [154] "J. Frost and J.G. Whittaker, 101-138."
## [155] "On the Recursive Implementation of Term Structure Models,'' "
## [156] "”Local Volatility and the Recovery Rate of Credit Default Swaps”, "
## [157] "(with Jeroen Jansen and Frank Fabozzi)."
## [158] "Efficient Rebalancing of Taxable Portfolios (with Dan Ostrov, Dennis Ding, Vincent Newell), "
## [159] "The Fast and the Curious: VC Drift "
## [160] "(with Amit Bubna and Paul Hanouna), "
## [161] "Venture Capital Communities (with Amit Bubna and Nagpurnanand Prabhala), "
## [162] " "
Take a look at the text now to see how cleaned up it is. But there is a better way, i.e., use the text-mining package tm.
The R programming language supports a text-mining package, succinctly named {tm}. Using functions such as {readDOC()}, {readPDF()}, etc., for reading DOC and PDF files, the package makes accessing various file formats easy.
Text mining involves applying functions to many text documents. A library of text documents (irrespective of format) is called a corpus. The essential and highly useful feature of text mining packages is the ability to operate on the entire set of documents at one go.
library(tm)## Loading required package: NLP
text = c("INTL is expected to announce good earnings report", "AAPL first quarter disappoints","GOOG announces new wallet", "YHOO ascends from old ways")
text_corpus = Corpus(VectorSource(text))
print(text_corpus)## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
writeCorpus(text_corpus)The writeCorpus() function in tm creates separate text files on the hard drive, and by default are names 1.txt, 2.txt, etc. The simple program code above shows how text scraped off a web page and collapsed into a single character string for each document, may then be converted into a corpus of documents using the Corpus() function.
It is easy to inspect the corpus as follows:
inspect(text_corpus)## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 49
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 30
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 25
##
## [[4]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 26
Here we use lapply to inspect the contents of the corpus.
#USING THE tm PACKAGE
library(tm)
text = c("Doc1;","This is doc2 --", "And, then Doc3.")
ctext = Corpus(VectorSource(text))
ctext## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
#writeCorpus(ctext)
#THE CORPUS IS A LIST OBJECT in R of type VCorpus or Corpus
inspect(ctext)## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 5
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15
print(as.character(ctext[[1]]))## [1] "Doc1;"
print(lapply(ctext[1:2],as.character))## $`1`
## [1] "Doc1;"
##
## $`2`
## [1] "This is doc2 --"
ctext = tm_map(ctext,tolower) #Lower case all text in all docs
inspect(ctext)## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## [1] doc1;
##
## [[2]]
## [1] this is doc2 --
##
## [[3]]
## [1] and, then doc3.
ctext2 = tm_map(ctext,toupper)
inspect(ctext2)## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## [1] DOC1;
##
## [[2]]
## [1] THIS IS DOC2 --
##
## [[3]]
## [1] AND, THEN DOC3.
#FIRST CURATE TO UPPER CASE
dropWords = c("IS","AND","THEN")
ctext2 = tm_map(ctext2,removeWords,dropWords)
inspect(ctext2)## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## [1] DOC1;
##
## [[2]]
## [1] THIS DOC2 --
##
## [[3]]
## [1] , DOC3.
ctext = Corpus(VectorSource(text))
temp = ctext
print(lapply(temp,as.character))## $`1`
## [1] "Doc1;"
##
## $`2`
## [1] "This is doc2 --"
##
## $`3`
## [1] "And, then Doc3."
temp = tm_map(temp,removeWords,stopwords("english"))
print(lapply(temp,as.character))## $`1`
## [1] "Doc1;"
##
## $`2`
## [1] "This doc2 --"
##
## $`3`
## [1] "And, Doc3."
temp = tm_map(temp,removePunctuation)
print(lapply(temp,as.character))## $`1`
## [1] "Doc1"
##
## $`2`
## [1] "This doc2 "
##
## $`3`
## [1] "And Doc3"
temp = tm_map(temp,removeNumbers)
print(lapply(temp,as.character))## $`1`
## [1] "Doc"
##
## $`2`
## [1] "This doc "
##
## $`3`
## [1] "And Doc"
We can create a bag of words by collapsing all the text into one bundle.
#CONVERT CORPUS INTO ARRAY OF STRINGS AND FLATTEN
txt = NULL
for (j in 1:length(temp)) {
txt = c(txt,temp[[j]]$content)
}
txt = paste(txt,collapse=" ")
txt = tolower(txt)
print(txt)## [1] "doc this doc and doc"
Now we will do a full pass through of this on my bio.
text = readLines("http://srdas.github.io/bio-candid.html")
ctext = Corpus(VectorSource(text))
ctext## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 79
print(lapply(ctext, as.character))## $`1`
## [1] "<HTML>"
##
## $`2`
## [1] "<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">"
##
## $`3`
## [1] ""
##
## $`4`
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at"
##
## $`5`
## [1] "Santa Clara University's Leavey School of Business. He previously held"
##
## $`6`
## [1] "faculty appointments as Associate Professor at Harvard Business School"
##
## $`7`
## [1] "and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and"
##
## $`8`
## [1] "Ph.D. from New York University), Computer Science (M.S. from UC"
##
## $`9`
## [1] "Berkeley), an MBA from the Indian Institute of Management, Ahmedabad,"
##
## $`10`
## [1] "B.Com in Accounting and Economics (University of Bombay, Sydenham"
##
## $`11`
## [1] "College), and is also a qualified Cost and Works Accountant. He is a"
##
## $`12`
## [1] "senior editor of The Journal of Investment Management, co-editor of"
##
## $`13`
## [1] "The Journal of Derivatives and The Journal of Financial Services"
##
## $`14`
## [1] "Research, and Associate Editor of other academic journals. Prior to"
##
## $`15`
## [1] "being an academic, he worked in the derivatives business in the"
##
## $`16`
## [1] "Asia-Pacific region as a Vice-President at Citibank. His current"
##
## $`17`
## [1] "research interests include: the modeling of default risk, machine"
##
## $`18`
## [1] "learning, social networks, derivatives pricing models, portfolio"
##
## $`19`
## [1] "theory, and venture capital. He has published over ninety articles in"
##
## $`20`
## [1] "academic journals, and has won numerous awards for research and"
##
## $`21`
## [1] "teaching. His recent book \"Derivatives: Principles and Practice\" was"
##
## $`22`
## [1] "published in May 2010. He currently also serves as a Senior Fellow at"
##
## $`23`
## [1] "the FDIC Center for Financial Research."
##
## $`24`
## [1] ""
##
## $`25`
## [1] ""
##
## $`26`
## [1] "<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>"
##
## $`27`
## [1] ""
##
## $`28`
## [1] "After loafing and working in many parts of Asia, but never really"
##
## $`29`
## [1] "growing up, Sanjiv moved to New York to change the world, hopefully"
##
## $`30`
## [1] "through research. He graduated in 1994 with a Ph.D. from NYU, and"
##
## $`31`
## [1] "since then spent five years in Boston, and now lives in San Jose,"
##
## $`32`
## [1] "California. Sanjiv loves animals, places in the world where the"
##
## $`33`
## [1] "mountains meet the sea, riding sport motorbikes, reading, gadgets,"
##
## $`34`
## [1] "science fiction movies, and writing cool software code. When there is"
##
## $`35`
## [1] "time available from the excitement of daily life, Sanjiv writes"
##
## $`36`
## [1] "academic papers, which helps him relax. Always the contrarian, Sanjiv"
##
## $`37`
## [1] "thinks that New York City is the most calming place in the world,"
##
## $`38`
## [1] "after California of course."
##
## $`39`
## [1] ""
##
## $`40`
## [1] "<p>"
##
## $`41`
## [1] ""
##
## $`42`
## [1] "Sanjiv is now a Professor of Finance at Santa Clara University. He came"
##
## $`43`
## [1] "to SCU from Harvard Business School and spent a year at UC Berkeley. In"
##
## $`44`
## [1] "his past life in the unreal world, Sanjiv worked at Citibank, N.A. in"
##
## $`45`
## [1] "the Asia-Pacific region. He takes great pleasure in merging his many"
##
## $`46`
## [1] "previous lives into his current existence, which is incredibly confused"
##
## $`47`
## [1] "and diverse."
##
## $`48`
## [1] ""
##
## $`49`
## [1] "<p>"
##
## $`50`
## [1] ""
##
## $`51`
## [1] "Sanjiv's research style is instilled with a distinct \"New York state of"
##
## $`52`
## [1] "mind\" - it is chaotic, diverse, with minimal method to the madness. He"
##
## $`53`
## [1] "has published articles on derivatives, term-structure models, mutual"
##
## $`54`
## [1] "funds, the internet, portfolio choice, banking models, credit risk, and"
##
## $`55`
## [1] "has unpublished articles in many other areas. Some years ago, he took"
##
## $`56`
## [1] "time off to get another degree in computer science at Berkeley,"
##
## $`57`
## [1] "confirming that an unchecked hobby can quickly become an obsession."
##
## $`58`
## [1] "There he learnt about the fascinating field of Randomized Algorithms,"
##
## $`59`
## [1] "skills he now applies earnestly to his editorial work, and other"
##
## $`60`
## [1] "pursuits, many of which stem from being in the epicenter of Silicon"
##
## $`61`
## [1] "Valley."
##
## $`62`
## [1] ""
##
## $`63`
## [1] "<p>"
##
## $`64`
## [1] ""
##
## $`65`
## [1] "Coastal living did a lot to mold Sanjiv, who needs to live near the"
##
## $`66`
## [1] "ocean. The many walks in Greenwich village convinced him that there is"
##
## $`67`
## [1] "no such thing as a representative investor, yet added many unique"
##
## $`68`
## [1] "features to his personal utility function. He learnt that it is"
##
## $`69`
## [1] "important to open the academic door to the ivory tower and let the world"
##
## $`70`
## [1] "in. Academia is a real challenge, given that he has to reconcile many"
##
## $`71`
## [1] "more opinions than ideas. He has been known to have turned down many"
##
## $`72`
## [1] "offers from Mad magazine to publish his academic work. As he often"
##
## $`73`
## [1] "explains, you never really finish your education - \"you can check out"
##
## $`74`
## [1] "any time you like, but you can never leave.\" Which is why he is doomed"
##
## $`75`
## [1] "to a lifetime in Hotel California. And he believes that, if this is as"
##
## $`76`
## [1] "bad as it gets, life is really pretty good."
##
## $`77`
## [1] ""
##
## $`78`
## [1] ""
##
## $`79`
## [1] ""
ctext = tm_map(ctext,removePunctuation)
print(lapply(ctext, as.character))## $`1`
## [1] "HTML"
##
## $`2`
## [1] "BODY backgroundhttpalgoscuedusanjivdasgraphicsback2gif"
##
## $`3`
## [1] ""
##
## $`4`
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at"
##
## $`5`
## [1] "Santa Clara Universitys Leavey School of Business He previously held"
##
## $`6`
## [1] "faculty appointments as Associate Professor at Harvard Business School"
##
## $`7`
## [1] "and UC Berkeley He holds postgraduate degrees in Finance MPhil and"
##
## $`8`
## [1] "PhD from New York University Computer Science MS from UC"
##
## $`9`
## [1] "Berkeley an MBA from the Indian Institute of Management Ahmedabad"
##
## $`10`
## [1] "BCom in Accounting and Economics University of Bombay Sydenham"
##
## $`11`
## [1] "College and is also a qualified Cost and Works Accountant He is a"
##
## $`12`
## [1] "senior editor of The Journal of Investment Management coeditor of"
##
## $`13`
## [1] "The Journal of Derivatives and The Journal of Financial Services"
##
## $`14`
## [1] "Research and Associate Editor of other academic journals Prior to"
##
## $`15`
## [1] "being an academic he worked in the derivatives business in the"
##
## $`16`
## [1] "AsiaPacific region as a VicePresident at Citibank His current"
##
## $`17`
## [1] "research interests include the modeling of default risk machine"
##
## $`18`
## [1] "learning social networks derivatives pricing models portfolio"
##
## $`19`
## [1] "theory and venture capital He has published over ninety articles in"
##
## $`20`
## [1] "academic journals and has won numerous awards for research and"
##
## $`21`
## [1] "teaching His recent book Derivatives Principles and Practice was"
##
## $`22`
## [1] "published in May 2010 He currently also serves as a Senior Fellow at"
##
## $`23`
## [1] "the FDIC Center for Financial Research"
##
## $`24`
## [1] ""
##
## $`25`
## [1] ""
##
## $`26`
## [1] "p BSanjiv Das A Short Academic Life HistoryB p"
##
## $`27`
## [1] ""
##
## $`28`
## [1] "After loafing and working in many parts of Asia but never really"
##
## $`29`
## [1] "growing up Sanjiv moved to New York to change the world hopefully"
##
## $`30`
## [1] "through research He graduated in 1994 with a PhD from NYU and"
##
## $`31`
## [1] "since then spent five years in Boston and now lives in San Jose"
##
## $`32`
## [1] "California Sanjiv loves animals places in the world where the"
##
## $`33`
## [1] "mountains meet the sea riding sport motorbikes reading gadgets"
##
## $`34`
## [1] "science fiction movies and writing cool software code When there is"
##
## $`35`
## [1] "time available from the excitement of daily life Sanjiv writes"
##
## $`36`
## [1] "academic papers which helps him relax Always the contrarian Sanjiv"
##
## $`37`
## [1] "thinks that New York City is the most calming place in the world"
##
## $`38`
## [1] "after California of course"
##
## $`39`
## [1] ""
##
## $`40`
## [1] "p"
##
## $`41`
## [1] ""
##
## $`42`
## [1] "Sanjiv is now a Professor of Finance at Santa Clara University He came"
##
## $`43`
## [1] "to SCU from Harvard Business School and spent a year at UC Berkeley In"
##
## $`44`
## [1] "his past life in the unreal world Sanjiv worked at Citibank NA in"
##
## $`45`
## [1] "the AsiaPacific region He takes great pleasure in merging his many"
##
## $`46`
## [1] "previous lives into his current existence which is incredibly confused"
##
## $`47`
## [1] "and diverse"
##
## $`48`
## [1] ""
##
## $`49`
## [1] "p"
##
## $`50`
## [1] ""
##
## $`51`
## [1] "Sanjivs research style is instilled with a distinct New York state of"
##
## $`52`
## [1] "mind it is chaotic diverse with minimal method to the madness He"
##
## $`53`
## [1] "has published articles on derivatives termstructure models mutual"
##
## $`54`
## [1] "funds the internet portfolio choice banking models credit risk and"
##
## $`55`
## [1] "has unpublished articles in many other areas Some years ago he took"
##
## $`56`
## [1] "time off to get another degree in computer science at Berkeley"
##
## $`57`
## [1] "confirming that an unchecked hobby can quickly become an obsession"
##
## $`58`
## [1] "There he learnt about the fascinating field of Randomized Algorithms"
##
## $`59`
## [1] "skills he now applies earnestly to his editorial work and other"
##
## $`60`
## [1] "pursuits many of which stem from being in the epicenter of Silicon"
##
## $`61`
## [1] "Valley"
##
## $`62`
## [1] ""
##
## $`63`
## [1] "p"
##
## $`64`
## [1] ""
##
## $`65`
## [1] "Coastal living did a lot to mold Sanjiv who needs to live near the"
##
## $`66`
## [1] "ocean The many walks in Greenwich village convinced him that there is"
##
## $`67`
## [1] "no such thing as a representative investor yet added many unique"
##
## $`68`
## [1] "features to his personal utility function He learnt that it is"
##
## $`69`
## [1] "important to open the academic door to the ivory tower and let the world"
##
## $`70`
## [1] "in Academia is a real challenge given that he has to reconcile many"
##
## $`71`
## [1] "more opinions than ideas He has been known to have turned down many"
##
## $`72`
## [1] "offers from Mad magazine to publish his academic work As he often"
##
## $`73`
## [1] "explains you never really finish your education you can check out"
##
## $`74`
## [1] "any time you like but you can never leave Which is why he is doomed"
##
## $`75`
## [1] "to a lifetime in Hotel California And he believes that if this is as"
##
## $`76`
## [1] "bad as it gets life is really pretty good"
##
## $`77`
## [1] ""
##
## $`78`
## [1] ""
##
## $`79`
## [1] ""
txt = NULL
for (j in 1:length(ctext)) {
txt = c(txt,ctext[[j]]$content)
}
txt = paste(txt,collapse=" ")
txt = tolower(txt)
print(txt)## [1] "html body backgroundhttpalgoscuedusanjivdasgraphicsback2gif sanjiv das is the william and janice terry professor of finance at santa clara universitys leavey school of business he previously held faculty appointments as associate professor at harvard business school and uc berkeley he holds postgraduate degrees in finance mphil and phd from new york university computer science ms from uc berkeley an mba from the indian institute of management ahmedabad bcom in accounting and economics university of bombay sydenham college and is also a qualified cost and works accountant he is a senior editor of the journal of investment management coeditor of the journal of derivatives and the journal of financial services research and associate editor of other academic journals prior to being an academic he worked in the derivatives business in the asiapacific region as a vicepresident at citibank his current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital he has published over ninety articles in academic journals and has won numerous awards for research and teaching his recent book derivatives principles and practice was published in may 2010 he currently also serves as a senior fellow at the fdic center for financial research p bsanjiv das a short academic life historyb p after loafing and working in many parts of asia but never really growing up sanjiv moved to new york to change the world hopefully through research he graduated in 1994 with a phd from nyu and since then spent five years in boston and now lives in san jose california sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code when there is time available from the excitement of daily life sanjiv writes academic papers which helps him relax always the contrarian sanjiv thinks that new york city is the most calming place in the world after california of course p sanjiv is now a professor of finance at santa clara university he came to scu from harvard business school and spent a year at uc berkeley in his past life in the unreal world sanjiv worked at citibank na in the asiapacific region he takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse p sanjivs research style is instilled with a distinct new york state of mind it is chaotic diverse with minimal method to the madness he has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas some years ago he took time off to get another degree in computer science at berkeley confirming that an unchecked hobby can quickly become an obsession there he learnt about the fascinating field of randomized algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of silicon valley p coastal living did a lot to mold sanjiv who needs to live near the ocean the many walks in greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function he learnt that it is important to open the academic door to the ivory tower and let the world in academia is a real challenge given that he has to reconcile many more opinions than ideas he has been known to have turned down many offers from mad magazine to publish his academic work as he often explains you never really finish your education you can check out any time you like but you can never leave which is why he is doomed to a lifetime in hotel california and he believes that if this is as bad as it gets life is really pretty good "
An extremeley important object in text analysis is the Term-Document Matrix. This allows us to store an entire library of text inside a single matrix. This may then be used for analysis as well as searching documents. It forms the basis of search engines, topic analysis, and classification (spam filtering).
It is a table that provides the frequency count of every word (term) in each document. The number of rows in the TDM is equal to the number of unique terms, and the number of columns is equal to the number of documents.
#TERM-DOCUMENT MATRIX
tdm = TermDocumentMatrix(ctext,control=list(minWordLength=1))
print(tdm)## <<TermDocumentMatrix (terms: 317, documents: 79)>>
## Non-/sparse entries: 497/24546
## Sparsity : 98%
## Maximal term length: 49
## Weighting : term frequency (tf)
inspect(tdm[10:20,11:18])## <<TermDocumentMatrix (terms: 11, documents: 8)>>
## Non-/sparse entries: 4/84
## Sparsity : 95%
## Maximal term length: 12
## Weighting : term frequency (tf)
##
## Docs
## Terms 11 12 13 14 15 16 17 18
## ago 0 0 0 0 0 0 0 0
## ahmedabad 0 0 0 0 0 0 0 0
## algorithms 0 0 0 0 0 0 0 0
## also 1 0 0 0 0 0 0 0
## always 0 0 0 0 0 0 0 0
## and 2 0 1 1 0 0 0 0
## animals 0 0 0 0 0 0 0 0
## another 0 0 0 0 0 0 0 0
## any 0 0 0 0 0 0 0 0
## applies 0 0 0 0 0 0 0 0
## appointments 0 0 0 0 0 0 0 0
out = findFreqTerms(tdm,lowfreq=5)
print(out)## [1] "academic" "and" "derivatives" "from" "has"
## [6] "his" "many" "research" "sanjiv" "that"
## [11] "the" "world"
This is a weighting scheme provided to sharpen the importance of rare words in a document, relative to the frequency of these words in the corpus. It is based on simple calculations and even though it does not have strong theoretical foundations, it is still very useful in practice. The TF-IDF is the importance of a word \(w\) in a document \(d\) in a corpus \(C\). Therefore it is a function of all these three, i.e., we write it as TF-IDF\((w,d,C)\), and is the product of term frequency (TF) and inverse document frequency (IDF).
The frequency of a word in a document is defined as \[ f(w,d) = \frac{\#w \in d}{|d|} \] where \(|d|\) is the number of words in the document. We usually normalize word frequency so that \[ TF(w,d) = \ln[f(w,d)] \] This is log normalization. Another form of normalization is known as double normalization and is as follows: \[ TF(w,d) = \frac{1}{2} + \frac{1}{2} \frac{f(w,d)}{\max_{w \in d} f(w,d)} \] Note that normalization is not necessary, but it tends to help shrink the difference between counts of words.
Inverse document frequency is as follows: \[ IDF(w,C) = \ln\left[ \frac{|C|}{|d_{w \in d}|} \right] \] That is, we compute the ratio of the number of documents in the corpus \(C\) divided by the number of documents with word \(w\) in the corpus.
Finally, we have the weighting score for a given word \(w\) in document \(d\) in corpus \(C\): \[ \mbox{TF-IDF}(w,d,C) = TF(w,d) \times IDF(w,C) \]
We illustrate this with an application to the previously computed term-document matrix.
tdm_mat = as.matrix(tdm) #Convert tdm into a matrix
print(dim(tdm_mat))## [1] 317 79
nw = dim(tdm_mat)[1]
nd = dim(tdm_mat)[2]
doc = 13 #Choose document
word = "derivatives" #Choose word
#COMPUTE TF
f = NULL
for (w in row.names(tdm_mat)) {
f = c(f,tdm_mat[w,doc]/sum(tdm_mat[,doc]))
}
fw = tdm_mat[word,doc]/sum(tdm_mat[,doc])
TF = 0.5 + 0.5*fw/max(f)
print(TF)## [1] 0.75
#COMPUTE IDF
nw = length(which(tdm_mat[word,]>0))
print(nw)## [1] 5
IDF = nd/nw
print(IDF)## [1] 15.8
#COMPUTE TF-IDF
TF_IDF = TF*IDF
print(TF_IDF) #With normalization## [1] 11.85
print(fw*IDF) #Without normalization## [1] 1.975
We can write this code into a function and work out the TF-IDF for all words. Then these word weights may be used in further text analysis.
We may also directly use the weightTfIdf function in the tm package. This undertakes the following computation:
Term frequency \({\it tf}_{i,j}\) counts the number of occurrences \(n_{i,j}\) of a term \(t_i\) in a document \(d_j\). In the case of normalization, the term frequency \(\mathit{tf}_{i,j}\) is divided by \(\sum_k n_{k,j}\).
Inverse document frequency for a term \(t_i\) is defined as \(\mathit{idf}_i = \log_2 \frac{|D|}{|{d_{t_i \in d}}|}\) where \(|D|\) denotes the total number of documents \(|{d_{t_i \in d}}|\) is the number of documents where the term \(t_i\) appears.
Term frequency - inverse document frequency is now defined as \(\mathit{tf}_{i,j} \cdot \mathit{idf}_i\).
Example:
library(tm)
textarray = c("Free software comes with ABSOLUTELY NO certain WARRANTY","You are welcome to redistribute free software under certain conditions","Natural language support for software in an English locale","A collaborative project with many contributors")
textcorpus = Corpus(VectorSource(textarray))
m = TermDocumentMatrix(textcorpus)
print(as.matrix(m))## Docs
## Terms 1 2 3 4
## absolutely 1 0 0 0
## are 0 1 0 0
## certain 1 1 0 0
## collaborative 0 0 0 1
## comes 1 0 0 0
## conditions 0 1 0 0
## contributors 0 0 0 1
## english 0 0 1 0
## for 0 0 1 0
## free 1 1 0 0
## language 0 0 1 0
## locale 0 0 1 0
## many 0 0 0 1
## natural 0 0 1 0
## project 0 0 0 1
## redistribute 0 1 0 0
## software 1 1 1 0
## support 0 0 1 0
## under 0 1 0 0
## warranty 1 0 0 0
## welcome 0 1 0 0
## with 1 0 0 1
## you 0 1 0 0
print(as.matrix(weightTfIdf(m)))## Docs
## Terms 1 2 3 4
## absolutely 0.28571429 0.00000000 0.00000000 0.0
## are 0.00000000 0.22222222 0.00000000 0.0
## certain 0.14285714 0.11111111 0.00000000 0.0
## collaborative 0.00000000 0.00000000 0.00000000 0.4
## comes 0.28571429 0.00000000 0.00000000 0.0
## conditions 0.00000000 0.22222222 0.00000000 0.0
## contributors 0.00000000 0.00000000 0.00000000 0.4
## english 0.00000000 0.00000000 0.28571429 0.0
## for 0.00000000 0.00000000 0.28571429 0.0
## free 0.14285714 0.11111111 0.00000000 0.0
## language 0.00000000 0.00000000 0.28571429 0.0
## locale 0.00000000 0.00000000 0.28571429 0.0
## many 0.00000000 0.00000000 0.00000000 0.4
## natural 0.00000000 0.00000000 0.28571429 0.0
## project 0.00000000 0.00000000 0.00000000 0.4
## redistribute 0.00000000 0.22222222 0.00000000 0.0
## software 0.05929107 0.04611528 0.05929107 0.0
## support 0.00000000 0.00000000 0.28571429 0.0
## under 0.00000000 0.22222222 0.00000000 0.0
## warranty 0.28571429 0.00000000 0.00000000 0.0
## welcome 0.00000000 0.22222222 0.00000000 0.0
## with 0.14285714 0.00000000 0.00000000 0.2
## you 0.00000000 0.22222222 0.00000000 0.0
In this segment we will learn some popular functions on text that are used in practice. One of the first things we like to do is to find similar text or like sentences (think of web search as one application). Since documents are vectors in the TDM, we may want to find the closest vectors or compute the distance between vectors.
\[ cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||} \]
where \(||A|| = \sqrt{A \cdot A}\), is the dot product of \(A\) with itself, also known as the norm of \(A\). This gives the cosine of the angle between the two vectors and is zero for orthogonal vectors and 1 for identical vectors.
#COSINE DISTANCE OR SIMILARITY
A = as.matrix(c(0,3,4,1,7,0,1))
B = as.matrix(c(0,4,3,0,6,1,1))
cos = t(A) %*% B / (sqrt(t(A)%*%A) * sqrt(t(B)%*%B))
print(cos)## [,1]
## [1,] 0.9682728
library(lsa)## Loading required package: SnowballC
#THE COSINE FUNCTION IN LSA ONLY TAKES ARRAYS
A = c(0,3,4,1,7,0,1)
B = c(0,4,3,0,6,1,1)
print(cosine(A,B))## [,1]
## [1,] 0.9682728
This package has a few additional functions that make the preceding ideas more streamlined to implement. First let’s read in the usual text.
library(ANLP)## Warning: package 'ANLP' was built under R version 3.2.5
## Loading required package: qdap
## Warning: package 'qdap' was built under R version 3.2.5
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
##
## ngrams
## The following object is masked from 'package:stringr':
##
## %>%
## The following object is masked from 'package:base':
##
## Filter
## Loading required package: RWeka
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:qdap':
##
## %>%
## The following object is masked from 'package:qdapTools':
##
## id
## The following objects are masked from 'package:qdapRegex':
##
## escape, explain
## The following object is masked from 'package:lsa':
##
## query
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: replacing previous import by 'tm::TermDocumentMatrix' when loading
## 'ANLP'
download.file("http://srdas.github.io/bio-candid.html",destfile = "text")
text = readTextFile("text","UTF-8")
ctext = cleanTextData(text) #Creates a text corpusThe last function removes non-english characters, numbers, white spaces, brackets, punctuation. It also handles cases like abbreviation, contraction. It converts entire text to lower case.
We now make TDMs for unigrams, bigrams, trigrams. Then, combine them all into one list for word prediction.
g1 = generateTDM(ctext,1)
g2 = generateTDM(ctext,2)
g3 = generateTDM(ctext,3)
gmodel = list(g1,g2,g3)Next, use the back-off algorithm to predict the next sequence of words.
print(predict_Backoff("you never",gmodel))## [1] "leave"
print(predict_Backoff("life is",gmodel))## [1] "the"
print(predict_Backoff("been known",gmodel))## [1] "to"
print(predict_Backoff("needs to",gmodel))## [1] "his"
print(predict_Backoff("worked at",gmodel))## [1] "citibank"
print(predict_Backoff("being an",gmodel))## [1] "unchecked"
print(predict_Backoff("publish",gmodel))## [1] "in"
Wordlcouds are interesting ways in which to represent text. They give an instant visual summary. The wordcloud package in R may be used to create your own wordclouds.
#MAKE A WORDCLOUD
library(wordcloud)
tdm2 = as.matrix(tdm)
wordcount = sort(rowSums(tdm2),decreasing=TRUE)
tdm_names = names(wordcount)
wordcloud(tdm_names,wordcount)#REMOVE STOPWORDS, NUMBERS, STEMMING
ctext1 = tm_map(ctext,removeWords,stopwords("english"))
ctext1 = tm_map(ctext1, removeNumbers)
tdm = TermDocumentMatrix(ctext1,control=list(minWordLength=1))
tdm2 = as.matrix(tdm)
wordcount = sort(rowSums(tdm2),decreasing=TRUE)
tdm_names = names(wordcount)
wordcloud(tdm_names,wordcount)Stemming is the procedure by which a word is reduced to its root or stem. This is done so as to treat words from the one stem as the same word, rather than as separate words. We do not want “eaten” and “eating” to be treated as different words for example.
#STEMMING
ctext2 = tm_map(ctext,removeWords,stopwords("english"))
ctext2 = tm_map(ctext2, stemDocument)
print(lapply(ctext2, as.character))## $`1`
## [1] ""
## [2] ""
## [3] ""
## [4] "sanjiv das william janic terri professor financ"
## [5] "santa clara univers leavey school busi previous held"
## [6] "faculti appoint associ professor harvard busi school"
## [7] " uc berkeley hold postgradu degre financ mphil"
## [8] "phd new york univers comput scienc ms uc"
## [9] "berkeley mba indian institut manag ahmedabad"
## [10] "bcom account econom univers bombay sydenham"
## [11] "colleg also qualifi cost work account "
## [12] "senior editor journal invest manag coeditor"
## [13] " journal deriv journal financi servic"
## [14] "research associ editor academ journal prior"
## [15] " academ work deriv busi "
## [16] "asiapacif region vicepresid citibank current"
## [17] "research interest includ model default risk machin"
## [18] "learn social network deriv price model portfolio"
## [19] "theori ventur capit publish nineti articl"
## [20] "academ journal won numer award research"
## [21] "teach recent book deriv principl practic"
## [22] "publish may current also serv senior fellow"
## [23] " fdic center financi research"
## [24] ""
## [25] ""
## [26] "sanjiv das short academ life histori"
## [27] ""
## [28] " loaf work mani part asia never realli"
## [29] "grow sanjiv move new york chang world hope"
## [30] " research graduat phd nyu"
## [31] "sinc spent five year boston now live san jose"
## [32] "california sanjiv love anim place world "
## [33] "mountain meet sea ride sport motorbik read gadget"
## [34] "scienc fiction movi write cool softwar code "
## [35] "time avail excit daili life sanjiv write"
## [36] "academ paper help relax alway contrarian sanjiv"
## [37] "think new york citi calm place world"
## [38] " california cours"
## [39] ""
## [40] ""
## [41] ""
## [42] "sanjiv now professor financ santa clara univers came"
## [43] " scu harvard busi school spent year uc berkeley"
## [44] " past life unreal world sanjiv work citibank na"
## [45] " asiapacif region take great pleasur merg mani"
## [46] "previous live current exist incred confus"
## [47] " divers"
## [48] ""
## [49] ""
## [50] ""
## [51] "sanjiv research style instil distinct new york state"
## [52] "mind chaotic divers minim method mad"
## [53] " publish articl deriv termstructur model mutual"
## [54] "fund internet portfolio choic bank model credit risk"
## [55] " unpublish articl mani area year ago took"
## [56] "time get anoth degre comput scienc berkeley"
## [57] "confirm uncheck hobbi can quick becom obsess"
## [58] " learnt fascin field random algorithm"
## [59] "skill now appli earnest editori work "
## [60] "pursuit mani stem epicent silicon"
## [61] "valley"
## [62] ""
## [63] ""
## [64] ""
## [65] "coastal live lot mold sanjiv need live near"
## [66] "ocean mani walk greenwich villag convinc "
## [67] " thing repres investor yet ad mani uniqu"
## [68] "featur person util function learnt "
## [69] "import open academ door ivori tower let world"
## [70] " academia real challeng given reconcil mani"
## [71] " opinion idea known turn mani"
## [72] "offer mad magazin publish academ work often"
## [73] "explain never realli finish educ can check"
## [74] " time like can never leav doom"
## [75] " lifetim hotel california believ "
## [76] "bad get life realli pretti good"
## [77] ""
## [78] ""
## [79] ""
Regular expressions are syntax used to modify strings in an efficient manner. They are complicated but extremely effective. Here we will illustrate with a few examples, but you are encouraged to explore more on your own, as the variations are endless. What you need to do will depend on the application at hand, and with some experience you will become better at using regular expressions. The initial use will however be somewhat confusing.
We start with a simple example of a text array where we wish replace the string “data” with a blank, i.e., we eliminate this string from the text we have.
library(tm)
#Create a text array
text = c("Doc1 is datavision","Doc2 is datatable","Doc3 is data","Doc4 is nodata","Doc5 is simpler")
print(text)## [1] "Doc1 is datavision" "Doc2 is datatable" "Doc3 is data"
## [4] "Doc4 is nodata" "Doc5 is simpler"
#Remove all strings with the chosen text for all docs
print(gsub("data","",text))## [1] "Doc1 is vision" "Doc2 is table" "Doc3 is " "Doc4 is no"
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the start even if they are longer than data
print(gsub("*data.*","",text))## [1] "Doc1 is " "Doc2 is " "Doc3 is " "Doc4 is no"
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the end even if they are longer than data
print(gsub("*.data*","",text))## [1] "Doc1 isvision" "Doc2 istable" "Doc3 is" "Doc4 is n"
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the end even if they are longer than data
print(gsub("*.data.*","",text))## [1] "Doc1 is" "Doc2 is" "Doc3 is" "Doc4 is n"
## [5] "Doc5 is simpler"
We now explore some more complex regular expressions. One case that is common is handling the search for special types of strings like telephone numbers. Suppose we have a text array that may contain telephone numbers in different formats, we can use a single grep command to extract these numbers. Here is some code to illustrate this.
#Create an array with some strings which may also contain telephone numbers as strings.
x = c("234-5678","234 5678","2345678","1234567890","0123456789","abc 234-5678","234 5678 def","xx 2345678","abc1234567890def")
#Now use grep to find which elements of the array contain telephone numbers
idx = grep("[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]",x)
print(idx)## [1] 1 2 4 6 7 9
print(x[idx])## [1] "234-5678" "234 5678" "1234567890"
## [4] "abc 234-5678" "234 5678 def" "abc1234567890def"
#We can shorten this as follows
idx = grep("[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9]{9}",x)
print(idx)## [1] 1 2 4 6 7 9
print(x[idx])## [1] "234-5678" "234 5678" "1234567890"
## [4] "abc 234-5678" "234 5678 def" "abc1234567890def"
#What if we want to extract only the phone number and drop the rest of the text?
pattern = "[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9]{9}"
print(regmatches(x, gregexpr(pattern,x)))## [[1]]
## [1] "234-5678"
##
## [[2]]
## [1] "234 5678"
##
## [[3]]
## character(0)
##
## [[4]]
## [1] "1234567890"
##
## [[5]]
## character(0)
##
## [[6]]
## [1] "234-5678"
##
## [[7]]
## [1] "234 5678"
##
## [[8]]
## character(0)
##
## [[9]]
## [1] "1234567890"
#Or use the stringr package, which is a lot better
library(stringr)
str_extract(x,pattern)## [1] "234-5678" "234 5678" NA "1234567890" NA
## [6] "234-5678" "234 5678" NA "1234567890"
Now we use grep to extract emails by looking for the “@” sign in the text string. We would proceed as in the following example.
x = c("sanjiv das","srdas@scu.edu","SCU","data@science.edu")
print(grep("\\@",x))## [1] 2 4
print(x[grep("\\@",x)])## [1] "srdas@scu.edu" "data@science.edu"
You get the idea. Using the functions gsub, grep, regmatches, and gregexpr, you can manage most fancy string handling that is needed.
The rvest package, written bu Hadley Wickham, is a powerful tool for extracting text from web pages. The package provides wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, and then manipulate, HTML and XML. The package is best illustrated with some simple examples.
The selector gadget ius a useful tool to be used in conjunction with the rvest package. It allows you to find the html tag in a web page that you need to pass to the program to parse the html page element you are interested in. Download from: http://selectorgadget.com/
Here is some code to read in the slashdot web page and gather the stories currently on their headlines.
library(rvest)## Warning: package 'rvest' was built under R version 3.2.5
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.2.5
##
## Attaching package: 'rvest'
## The following object is masked from 'package:qdap':
##
## %>%
## The following object is masked from 'package:XML':
##
## xml
url = "https://slashdot.org/"
doc.html = read_html(url)
text = doc.html %>% html_nodes(".story") %>% html_text()
text = gsub("[\t\n]","",text)
#text = paste(text, collapse=" ")
print(text[1:20])## [1] " Ask Slashdot: What's The Best Geeky Gift For Children?1"
## [2] " Why Apple Just Invested in Wind Turbines In China (cnn.com) 45"
## [3] " Struggling Workers Found Sleeping In Tents Behind Amazon's Warehouse (thecourier.co.uk) 147"
## [4] " Analysts Tout 'State of The Developer' Survey By Awarding RPG Characters (amazon.com) 20"
## [5] " Inside the NYPD's Attempt To Build Community Trust Through Twitter (backchannel.com) 35"
## [6] " Fedora-based Linux Distro Korora (Version 25) Now Available For Download (betanews.com) 21"
## [7] " FBI Relents, Confirms Previously-Denied UFO Investigation (muckrock.com) 58"
## [8] " A 'Turkish Hacker' Is Giving Out Prizes For DDoS Attacks (csoonline.com) 29"
## [9] " The DEA Has Been Secretly Paying Transport Employees To Search Travelers' Bags (economist.com) 118"
## [10] " 5-Year-Old Critical Linux Vulnerability Patched (threatpost.com) 53"
## [11] " Uber Asks Everyone To Stop Making It The New Tinder (sfgate.com) 122"
## [12] " New Bug In Windows 10 Anniversary Update Brings Wi-Fi Disconnects (infoworld.com) 133"
## [13] " US Think Tank Wants To Regulate The Design of IoT Devices For Security Purposes (theregister.co.uk) 78"
## [14] " Autonomous Shuttle Brakes For Squirrels, Skateboarders, and Texting Students (ieee.org) 68"
## [15] " 'Star In a Jar' Fusion Reactor Works, Promises Infinite Energy (space.com) 334"
## [16] NA
## [17] NA
## [18] NA
## [19] NA
## [20] NA
Sometimes we need to read a table embedded in a web page and this is also a simple exercise, which is undertaken also with rvest.
library(rvest)
url = "http://finance.yahoo.com/q?uhb=uhb2&fr=uh3_finance_vert_gs&type=2button&s=IBM"
doc.html = read_html(url)
table = doc.html %>% html_nodes("table") %>% html_table()
print(table)## [[1]]
## X1 X2
## 1 NA Search
##
## [[2]]
## X1 X2
## 1 Previous Close 165.36
## 2 Open 166.00
## 3 Bid 166.42 x 300
## 4 Ask 166.59 x 100
## 5 Day's Range 164.60 - 166.72
## 6 52 Week Range 116.90 - 166.72
## 7 Volume 3,146,930
## 8 Avg. Volume 3,585,104
##
## [[3]]
## X1 X2
## 1 Market Cap 158.34B
## 2 Beta 0.91
## 3 PE Ratio (TTM) 13.57
## 4 EPS (TTM) N/A
## 5 Earnings Date N/A
## 6 Dividend & Yield 5.60 (3.49%)
## 7 Ex-Dividend Date N/A
## 8 1y Target Est N/A
Note that this code extracted all the web tables in the Yahoo! Finance page and returned each one as a list item.
Here we take note of some Russian language sites where we want to extract forex quotes and store them in a data frame.
library(rvest)
url1 <- "http://finance.i.ua/market/kiev/?type=1" #Buy USD
url2 <- "http://finance.i.ua/market/kiev/?type=2" #Sell USD
doc1.html = read_html(url1)
table1 = doc1.html %>% html_nodes("table") %>% html_table()
result1 = table1[[1]]
print(head(result1))## X1 X2 X3 X4
## 1 Время Курс Сумма Телефон
## 2 01:48 26.8801 114700 $ +38 093 \n Показать
## 3 06:22 26.9 100 $ +38 093 \n Показать
## 4 13:28 26.9 5000 $ +38 093 \n Показать
## 5 13:29 26.901 37000 $ +38 093 \n Показать
## 6 13:29 28.65 10000 € +38 093 \n Показать
## X5
## 1 Район
## 2 Ленинградская площадь Обменный пункт
## 3 м Тараса Шеченка,
## 4 Центр Л. Тостого Д. Спорта Олимпийский
## 5 Обмен Валют Ленинградка Харьковское
## 6 подол
## X6
## 1 Комментарий
## 2 От 1000 дол. Крупная гривна. Звоните с 7. 00. Ярослав
## 3 Нового образца
## 4 Можно частями, могу подъехать от 500 или за €вро
## 5 От 3т. 500 грн купюры. Звоните. Ярослав
## 6 можно частями
doc2.html = read_html(url2)
table2 = doc2.html %>% html_nodes("table") %>% html_table()
result2 = table2[[1]]
print(head(result2))## X1 X2 X3 X4
## 1 Время Курс Сумма Телефон
## 2 01:36 0.426 970000 \u20bd +38 093 \n Показать
## 3 01:47 26.9799 147000 $ +38 093 \n Показать
## 4 01:50 28.7699 27000 € +38 093 \n Показать
## 5 14:48 27.0 3500 $ +38 050 \n Показать
## 6 13:29 26.97 55000 $ +38 096 \n Показать
## X5
## 1 Район
## 2 Ленинградская площадь Обмен Валют
## 3 Ленинградская площадь Обменный пункт
## 4 Ленинградская площадь Обменный пунк
## 5 Голосеевский
## 6 Еврогазбанк петровка
## X6
## 1 Комментарий
## 2 От 200т рублей. 5000 купюры. Или за доллар 63. 15. Звоните с 6. 00. Ярослав
## 3 От 100 дол. Без комиссий. Нового образца. Звоните с 7. 00. Ярослав
## 4 От 3т евро. Разные купюры. Звоните с 7. 00. Ярослав
## 5 можно частями, обмен валют, м. Лыбедская, Антоновича (Горького)
## 6 можно частями
#Clicking Show More button Google Scholar page
library(RCurl)
library(RSelenium)
library(rvest)
library(stringr)
library(igraph)
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444
, browserName = "firefox"
)
remDr$open()
remDr$getStatus()remDr$navigate("http://scholar.google.com")
webElem <- remDr$findElement(using = 'css selector', "input#gs_hp_tsi")
webElem$sendKeysToElement(list("Sanjiv Das", "\uE007"))
link <- webElem$getCurrentUrl()
page <- read_html(as.character(link))
citations <- page %>% html_nodes (".gs_rt2")
matched <- str_match_all(citations, "<a href=\"(.*?)\"")
scholarurl <- paste("https://scholar.google.com", matched[[1]][,2], sep="")
page <- read_html(as.character(scholarurl))
remDr$navigate(as.character(scholarurl))
authorlist <- page %>% html_nodes(css=".gs_gray") %>% html_text() # Selecting fields after CSS selector .gs_gray
authorlist <- as.data.frame(authorlist)
odd_index <- seq(1,nrow(authorlist),2) #Sorting data by even/odd indexes to form a table.
even_index <- seq (2,nrow(authorlist),2)
authornames <- data.frame(x=authorlist[odd_index,1])
papernames <- data.frame(x=authorlist[even_index,1])
pubmatrix <- cbind(authorlist,papernames)
# Building the view all link on scholar page.
a=str_split(matched, "user=")
x <- substring(a[[1]][2], 1,12)
y<- paste("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=", x, sep="")
remDr$navigate(y)
#Reading view all page to get author list:
page <- read_html(as.character(y))
z <- page %>% html_nodes (".gsc_1usr_name")
x <-lapply(z,str_extract,">[A-Z]+[a-z]+ .+<")
x<-lapply(x,str_replace, ">","")
x<-lapply(x,str_replace, "<","")
# Graph function:
bsk <- as.matrix(cbind("SR Das", unlist(x)))
bsk.network<-graph.data.frame(bsk, directed=F)
plot(bsk.network)We now look to getting text from the web and using various APIs from different services like Twitter, Facebook, etc. You will need to open free developer accounts to do this on each site. You will also need the special R packages for each different source.
First create a Twitter developer account to get the required credentials for accessing the API. See: https://dev.twitter.com/
The Twitter API needs a lot of handshaking…
##TWITTER EXTRACTOR
library(twitteR)
library(ROAuth)
library(RCurl)
download.file(url="https://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
#certificate file based on Privacy Enhanced Mail (PEM) protocol: https://en.wikipedia.org/wiki/Privacy-enhanced_Electronic_Mail
cKey = "rIXqaZNxJ4A8YB6jhJsXEh9HX" #These are my keys and won't work for you
cSecret = "KC5kRgsJVrBV6vNIndrF69tcHFfzwcqpQOzLMO80Cu3dVFpcZb" #use your own secret
reqURL = "https://api.twitter.com/oauth/request_token"
accURL = "https://api.twitter.com/oauth/access_token"
authURL = "https://api.twitter.com/oauth/authorize"
#NOW SUBMIT YOUR CODES AND ASK FOR CREDENTIALS
cred = OAuthFactory$new(consumerKey=cKey, consumerSecret=cSecret,requestURL=reqURL, accessURL=accURL,authURL=authURL)
cred$handshake(cainfo="cacert.pem") #Asks for token
#Test and save credentials
#registerTwitterOAuth(cred)
#save(list="cred",file="twitteR_credentials")
#FIRST PHASE DONE##USE httr, SECOND PHASE
library(httr)
#options(httr_oauth_cache=T)
accToken = "18666236-DmDE1wwbpvPbDcw9kwt9yThGeyYhjfpVVywrHuhOQ"
accTokenSecret = "cttbpxpTtqJn7wrCP36I59omNI5GQHXXgV41sKwUgc"
setup_twitter_oauth(cKey,cSecret,accToken,accTokenSecret) #At prompt type 1This more direct code chunk does handshaking better and faster than the preceding.
library(stringr)
library(twitteR)##
## Attaching package: 'twitteR'
## The following objects are masked from 'package:dplyr':
##
## id, location
## The following object is masked from 'package:qdapTools':
##
## id
library(ROAuth)
library(RCurl)## Warning: package 'RCurl' was built under R version 3.2.4
## Loading required package: bitops
cKey = "rIXqaZNxJ4A8YB6jhJsXEh9HX"
cSecret = "KC5kRgsJVrBV6vNIndrF69tcHFfzwcqpQOzLMO80Cu3dVFpcZb"
accToken = "18666236-DmDE1wwbpvPbDcw9kwt9yThGeyYhjfpVVywrHuhOQ"
accTokenSecret = "cttbpxpTtqJn7wrCP36I59omNI5GQHXXgV41sKwUgc"
setup_twitter_oauth(consumer_key = cKey,
consumer_secret = cSecret,
access_token = accToken,
access_secret = accTokenSecret)## [1] "Using direct authentication"
This completes the handshaking with Twitter. Now we can access tweets using the functions in the twitteR package.
#EXAMPLE 1
s = searchTwitter("#GOOG") #This is a list
s
#CONVERT TWITTER LIST TO TEXT ARRAY (see documentation in twitteR package)
twts = twListToDF(s) #This gives a dataframe with the tweets
names(twts)
twts_array = twts$text
print(twts$retweetCount)
twts_array
#EXAMPLE 2
s = getUser("srdas")
fr = s$getFriends()
print(length(fr))
print(fr[1:10])
s_tweets = userTimeline("srdas",n=20)
print(s_tweets)
getCurRateLimitInfo(c("srdas"))This assumes you have a working twitter account and have already connected R to it using twitteR package.
library(streamR)
filterStream(file.name = "tweets.json", # Save tweets in a json file
track = "useR_Stanford" , # Collect tweets with useR_Stanford over 60 seconds. Can use twitter handles or keywords.
language = "en",
timeout = 30, # Keep connection alive for 60 seconds
oauth = cred) # Use OAuth credentials
tweets.df <- parseTweets("tweets.json", simplify = FALSE) # parse the json file and save to a data frame called tweets.df. Simplify = FALSE ensures that we include lat/lon information in that data frame.filterStream(file.name = "tweets.json", # Save tweets in a json file
track = "3497513953" , # Collect tweets from useR2016 feed over 60 seconds. Must use twitter ID of the user.
language = "en",
timeout = 30, # Keep connection alive for 60 seconds
oauth = cred) # Use my_oauth file as the OAuth credentials
tweets.df <- parseTweets("tweets.json", simplify = FALSE)userStream( file.name="my_timeline.json", with="followings",tweets=10, oauth=cred )Now we move on to using Facebook, which is a little less trouble than Twitter. Also the results may be used for creating interesting networks.
##FACEBOOK EXTRACTOR
library(Rfacebook)
library(SnowballC)
library(Rook)
library(ROAuth)
app_id = "847737771920076" # USE YOUR OWN IDs
app_secret = "a120a2ec908d9e00fcd3c619cad7d043"
fb_oauth = fbOAuth(app_id,app_secret,extended_permissions=TRUE)
#save(fb_oauth,file="fb_oauth")
#DIRECT LOAD
load("fb_oauth")##EXAMPLES
bbn = getUsers("bloombergnews",token=fb_oauth)
print(bbn)
page = getPage(page="bloombergnews",token=fb_oauth,n=20)
print(dim(page))
print(head(page))
print(names(page))
print(page$message)
print(page$message[11])First we examine the protocol for connecting to the Yelp API. This assumes you have opei
###CODE to connect to YELP.
consumerKey = "z6w-Or6HSyKbdUTmV9lbOA"
consumerSecret = "ImUufP3yU9FmNWWx54NUbNEBcj8"
token = "mBzEBjhYIGgJZnmtTHLVdQ-0cyfFVRGu"
token_secret = "v0FGCL0TS_dFDWFwH3HptDZhiLE"require(httr)
require(httpuv)
require(jsonlite)
# authorization
myapp = oauth_app("YELP", key=consumerKey, secret=consumerSecret)
sig=sign_oauth1.0(myapp, token=token,token_secret=token_secret)## Searching the top ten bars in Chicago and SF.
limit <- 10
# 10 bars in Chicago
yelpurl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,"&location=Chicago%20IL&term=bar")
# or 10 bars by geo-coordinates
yelpurl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,"&ll=37.788022,-122.399797&term=bar")
locationdata=GET(yelpurl, sig)
locationdataContent = content(locationdata)
locationdataList=jsonlite::fromJSON(toJSON(locationdataContent))
head(data.frame(locationdataList))
for (j in 1:limit) {
print(locationdataContent$businesses[[j]]$snippet_text)
}Webster’s defines a “dictionary” as “…a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses.”
The Harvard General Inquirer: http://www.wjh.harvard.edu/~inquirer/
Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.
Computer dictionary: http://www.hyperdictionary.com/computer that contains about 14,000 computer related words, such as “byte” or “hyperlink”.
Math dictionary, such as http://www.amathsdictionaryforkids.com/dictionary.html.
Medical dictionary, see http://www.hyperdictionary.com/medical.
Internet lingo dictionaries may be used to complement standard dictionaries with words that are not usually found in standard language, for example, see http://www.netlingo.com/dictionary/all.php for words such as “2BZ4UQT” which stands for “too busy for you cutey” (LOL). When extracting text messages, postings on Facebook, or stock message board discussions, internet lingo does need to be parsed and such a dictionary is very useful.
Associative dictionaries are also useful when trying to find context, as the word may be related to a concept, identified using a dictionary such as http://www.visuwords.com/. This dictionary doubles up as a thesaurus, as it provides alternative words and phrases that mean the same thing, and also related concepts.
Value dictionaries deal with values and may be useful when only affect (positive or negative) is insufficient for scoring text. The Lasswell Value Dictionary http://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used to score the loading of text on the eight basic value categories: Wealth, Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Well being.
A lexicon is defined by Webster’s as “a book containing an alphabetical arrangement of the words in a language and their definitions; the vocabulary of a language, an individual speaker or group of speakers, or a subject; the total stock of morphemes in a language.” This suggests it is not that different from a dictionary.
A “morpheme” is defined as “a word or a part of a word that has a meaning and that contains no smaller part that has a meaning.”
In the text analytics realm, we will take a lexicon to be a smaller, special purpose dictionary, containing words that are relevant to the domain of interest.
The benefit of a lexicon is that it enables focusing only on words that are relevant to the analytics and discards words that are not.
Another benefit is that since it is a smaller dictionary, the computational effort required by text analytics algorithms is drastically reduced.
By hand. This is an effective technique and the simplest. It calls for a human reader who scans a representative sample of text documents and culls important words that lend interpretive meaning.
Examine the term document matrix for most frequent words, and pick the ones that have high connotation for the classification task at hand.
Use pre-classified documents in a text corpus. We analyze the separate groups of documents to find words whose difference in frequency between groups is highest. Such words are likely to be better in discriminating between groups.
Das and Chen (2007) constructed a lexicon of about 375 words that are useful in parsing sentiment from stock message boards. This lexicon also introduced the notion of “negation tagging” into the literature.
Text can be scored using dictionaries and word lists. Here is an example of mood scoring. We use a psychological dictionary from Harvard. There is also WordNet.
WordNet is a large database of words in English, i.e., a lexicon. The repository is at http://wordnet.princeton.edu. WordNet groups words together based on their meanings (synonyms) and hence may be used as a thesaurus. WordNet is also useful for natural language processing as it provides word lists by language category, such as noun, verb, adjective, etc.
#MOOD SCORING USING HARVARD INQUIRER
#Read in the Harvard Inquirer Dictionary
#And create a list of positive and negative words
HIDict = readLines("data_files/inqdict.txt")
dict_pos = HIDict[grep("Pos",HIDict)]
poswords = NULL
for (s in dict_pos) {
s = strsplit(s,"#")[[1]][1]
poswords = c(poswords,strsplit(s," ")[[1]][1])
}
dict_neg = HIDict[grep("Neg",HIDict)]
negwords = NULL
for (s in dict_neg) {
s = strsplit(s,"#")[[1]][1]
negwords = c(negwords,strsplit(s," ")[[1]][1])
}
poswords = tolower(poswords)
negwords = tolower(negwords)
print(sample(poswords,25))## [1] "athletic" "matchless" "woo" "clear"
## [5] "defend" "unbroken" "truth" "abundance"
## [9] "tactics" "joke" "safe" "generate"
## [13] "considerate" "plain" "self-respect" "staunchness"
## [17] "allow" "glee" "astound" "sparkle"
## [21] "standardize" "sympathetic" "brainy" "fair"
## [25] "advance"
print(sample(negwords,25))## [1] "frighten" "regression" "recklessness" "expose"
## [5] "berserk" "competitor" "corrode" "unimpeachable"
## [9] "paralysis" "malicious" "thorny" "battle"
## [13] "peculiar" "defect" "depress" "unforgettable"
## [17] "shock" "pound" "disapproval" "mind"
## [21] "sketchy" "beastly" "unseen" "wrought"
## [25] "temptation"
poswords = unique(poswords)
negwords = unique(negwords)
print(length(poswords))## [1] 1647
print(length(negwords))## [1] 2121
The preceding code created two arrays, one of positive words and another of negative words.
You can also directly use the EmoLex which contains positive and negative words already, see: NRC Word-Emotion Lexicon: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
In order to score text, we need to clean it first and put it into an array to compare with the word list of positive and negative words. I wrote a general purpose function that grabs text and cleans it up for further use.
library(tm)
library(stringr)
#READ IN TEXT FOR ANALYSIS, PUT IT IN A CORPUS, OR ARRAY, OR FLAT STRING
#cstem=1, if stemming needed
#cstop=1, if stopwords to be removed
#ccase=1 for lower case, ccase=2 for upper case
#cpunc=1, if punctuation to be removed
#cflat=1 for flat text wanted, cflat=2 if text array, else returns corpus
read_web_page = function(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=0) {
text = readLines(url)
text = text[setdiff(seq(1,length(text)),grep("<",text))]
text = text[setdiff(seq(1,length(text)),grep(">",text))]
text = text[setdiff(seq(1,length(text)),grep("]",text))]
text = text[setdiff(seq(1,length(text)),grep("}",text))]
text = text[setdiff(seq(1,length(text)),grep("_",text))]
text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
ctext = Corpus(VectorSource(text))
if (cstem==1) { ctext = tm_map(ctext, stemDocument) }
if (cstop==1) { ctext = tm_map(ctext, removeWords, stopwords("english"))}
if (cpunc==1) { ctext = tm_map(ctext, removePunctuation) }
if (ccase==1) { ctext = tm_map(ctext, tolower) }
if (ccase==2) { ctext = tm_map(ctext, toupper) }
text = ctext
#CONVERT FROM CORPUS IF NEEDED
if (cflat>0) {
text = NULL
for (j in 1:length(ctext)) {
temp = ctext[[j]]$content
if (temp!="") { text = c(text,temp) }
}
text = as.array(text)
}
if (cflat==1) {
text = paste(text,collapse="\n")
text = str_replace_all(text, "[\r\n]" , " ")
}
result = text
}Now apply this function and see how we can get some clean text.
url = "http://srdas.github.io/research.htm"
res = read_web_page(url,0,0,0,1,1)
print(res)## [1] "Data Science Theories Models Algorithms and Analytics web book work in progress Derivatives Principles and Practice 2010 Rangarajan Sundaram and Sanjiv Das McGraw Hill An IndexBased Measure of Liquidity with George Chacko and Rong Fan 2016 Matrix Metrics NetworkBased Systemic Risk Scoring 2016 of systemic risk This paper won the First Prize in the MITCFP competition 2016 for the best paper on SIFIs systemically important financial institutions It also won the best paper award at Credit Spreads with Dynamic Debt with Seoyoung Kim 2015 Text and Context Language Analytics for Finance 2014 Strategic Loan Modification An OptionsBased Response to Strategic Default Options and Structured Products in Behavioral Portfolios with Meir Statman 2013 and barrier range notes in the presence of fattailed outcomes using copulas Polishing Diamonds in the Rough The Sources of Syndicated Venture Performance 2011 with Hoje Jo and Yongtae Kim Optimization with Mental Accounts 2010 with Harry Markowitz Jonathan Accountingbased versus marketbased crosssectional models of CDS spreads with Paul Hanouna and Atulya Sarin 2009 Hedging Credit Equity Liquidity Matters with Paul Hanouna 2009 An Integrated Model for Hybrid Securities Yahoo for Amazon Sentiment Extraction from Small Talk on the Web Common Failings How Corporate Defaults are Correlated with Darrell Duffie Nikunj Kapadia and Leandro Saita A Clinical Study of Investor Discussion and Sentiment with Asis MartinezJerez and Peter Tufano 2005 International Portfolio Choice with Systemic Risk The loss resulting from diminished diversification is small while Speech Signaling Risksharing and the Impact of Fee Structures on investor welfare Contrary to regulatory intuition incentive structures A DiscreteTime Approach to Noarbitrage Pricing of Credit derivatives with Rating Transitions with Viral Acharya and Rangarajan Sundaram Pricing Interest Rate Derivatives A General Approachwith George Chacko A DiscreteTime Approach to ArbitrageFree Pricing of Credit Derivatives The Psychology of Financial Decision Making A Case for TheoryDriven Experimental Enquiry 1999 with Priya Raghubir Of Smiles and Smirks A Term Structure Perspective A Theory of Banking Structure 1999 with Ashish Nanda by function based upon two dimensions the degree of information asymmetry A Theory of Optimal Timing and Selectivity A Direct DiscreteTime Approach to PoissonGaussian Bond Option Pricing in the HeathJarrowMorton The Central Tendency A Second Factor in Bond Yields 1998 with Silverio Foresi and Pierluigi Balduzzi Efficiency with Costly Information A Reinterpretation of Evidence from Managed Portfolios with Edwin Elton Martin Gruber and Matt Presented and Reprinted in the Proceedings of The Seminar on the Analysis of Security Prices at the Center for Research in Security Prices at the University of Managing Rollover Risk with Capital Structure Covenants in Structured Finance Vehicles 2016 The Design and Risk Management of Structured Finance Vehicles 2016 Post the recent subprime financial crisis we inform the creation of safer SIVs in structured finance and propose avenues of mitigating risks faced by senior debt through Coming up Short Managing Underfunded Portfolios in an LDIES Framework 2014 with Seoyoung Kim and Meir Statman Going for Broke Restructuring Distressed Debt Portfolios 2014 Digital Portfolios 2013 Options on Portfolios with HigherOrder Moments 2009 options on a multivariate system of assets calibrated to the return Dealing with Dimension Option Pricing on Factor Trees 2009 you to price options on multiple assets in a unified fraamework Computational Modeling Correlated Default with a Forest of Binomial Trees 2007 with Basel II Correlation Related Issues 2007 Correlated Default Risk 2006 with Laurence Freed Gary Geng and Nikunj Kapadia increase as markets worsen Regime switching models are needed to explain dynamic A Simple Model for Pricing Equity Options with Markov Switching State Variables 2006 with Donald Aingworth and Rajeev Motwani The Firms Management of Social Interactions 2005 with D Godes D Mayzlin Y Chen S Das C Dellarocas B Pfeieffer B Libai S Sen M Shi and P Verlegh Financial Communities with Jacob Sisk 2005 Summer 112123 Monte Carlo Markov Chain Methods for Derivative Pricing and Risk Assessmentwith Alistair Sinclair 2005 where incomplete information about the value of an asset may be exploited to undertake fast and accurate pricing Proof that a fully polynomial randomized Correlated Default Processes A CriterionBased Copula Approach Special Issue on Default Risk Private Equity Returns An Empirical Examination of the Exit of VentureBacked Companies with Murali Jagannathan and Atulya Sarin firm being financed the valuation at the time of financing and the prevailing market sentiment Helps understand the risk premium required for the Issue on Computational Methods in Economics and Finance December 5569 Bayesian Migration in Credit Ratings Based on Probabilities of The Impact of Correlated Default Risk on Credit Portfolios with Gifford Fong and Gary Geng How Diversified are Internationally Diversified Portfolios TimeVariation in the Covariances between International Returns DiscreteTime Bond and Option Pricing for JumpDiffusion Macroeconomic Implications of Search Theory for the Labor Market Auction Theory A Summary with Applications and Evidence from the Treasury Markets 1996 with Rangarajan Sundaram A Simple Approach to Three Factor Affine Models of the Term Structure with Pierluigi Balduzzi Silverio Foresi and Rangarajan Analytical Approximations of the Term Structure for Jumpdiffusion Processes A Numerical Analysis 1996 Markov Chain Term Structure Models Extensions and Applications Exact Solutions for Bond and Options Prices with Systematic Jump Risk 1996 with Silverio Foresi Pricing Credit Sensitive Debt when Interest Rates Credit Ratings and Credit Spreads are Stochastic 1996 v52 161198 Did CDS Trading Improve the Market for Corporate Bonds 2016 with Madhu Kalimipalli and Subhankar Nayak Big Datas Big Muscle 2016 Portfolios for Investors Who Want to Reach Their Goals While Staying on the MeanVariance Efficient Frontier 2011 with Harry Markowitz Jonathan Scheid and Meir Statman News Analytics Framework Techniques and Metrics The Handbook of News Analytics in Finance May 2011 John Wiley Sons UK Random Lattices for Option Pricing Problems in Finance 2011 Implementing Option Pricing Models using Python and Cython 2010 The Finance Web Internet Information and Markets 2010 Financial Applications with Parallel R 2009 Recovery Swaps 2009 with Paul Hanouna Recovery Rates 2009with Paul Hanouna A Simple Model for Pricing Securities with a DebtEquity Linkage 2008 in Credit Default Swap Spreads 2006 with Paul Hanouna MultipleCore Processors for Finance Applications 2006 Power Laws 2005 with Jacob Sisk Genetic Algorithms 2005 Recovery Risk 2005 Venture Capital Syndication with Hoje Jo and Yongtae Kim 2004 Technical Analysis with David Tien 2004 Liquidity and the Bond Markets with Jan Ericsson and Madhu Kalimipalli 2003 Modern Pricing of Interest Rate Derivatives Book Review Contagion 2003 Hedge Funds 2003 Reprinted in Working Papers on Hedge Funds in The World of Hedge Funds Characteristics and Analysis 2005 World Scientific The Internet and Investors 2003 Useful things to know about Correlated Default Risk with Gifford Fong Laurence Freed Gary Geng and Nikunj Kapadia The Regulation of Fee Structures in Mutual Funds A Theoretical Analysis with Rangarajan Sundaram 1998 NBER WP No 6639 in the Courant Institute of Mathematical Sciences special volume on A DiscreteTime Approach to ArbitrageFree Pricing of Credit Derivatives with Rangarajan Sundaram reprinted in the Courant Institute of Mathematical Sciences special volume on Stochastic Mean Models of the Term Structure with Pierluigi Balduzzi Silverio Foresi and Rangarajan Sundaram John Wiley Sons Inc 128161 Interest Rate Modeling with JumpDiffusion Processes John Wiley Sons Inc 162189 Comments on Pricing ExcessofLoss Reinsurance Contracts against Catastrophic Loss by J David Cummins C Lewis and Richard Phillips Froot Ed University of Chicago Press 1999 141145 Pricing Credit Derivatives J Frost and JG Whittaker 101138 On the Recursive Implementation of Term Structure Models Local Volatility and the Recovery Rate of Credit Default Swaps with Jeroen Jansen and Frank Fabozzi Efficient Rebalancing of Taxable Portfolios with Dan Ostrov Dennis Ding Vincent Newell The Fast and the Curious VC Drift with Amit Bubna and Paul Hanouna Venture Capital Communities with Amit Bubna and Nagpurnanand Prabhala "
Now we will take a different page of text and mood score it.
#EXAMPLE OF MOOD SCORING
library(stringr)
url = "http://srdas.github.io/bio-candid.html"
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=1,cflat=1)
print(text)## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara Universitys Leavey School of Business He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley He holds postgraduate degrees in Finance MPhil and PhD from New York University Computer Science MS from UC Berkeley an MBA from the Indian Institute of Management Ahmedabad BCom in Accounting and Economics University of Bombay Sydenham College and is also a qualified Cost and Works Accountant He is a senior editor of The Journal of Investment Management coeditor of The Journal of Derivatives and The Journal of Financial Services Research and Associate Editor of other academic journals Prior to being an academic he worked in the derivatives business in the AsiaPacific region as a VicePresident at Citibank His current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital He has published over ninety articles in academic journals and has won numerous awards for research and teaching His recent book Derivatives Principles and Practice was published in May 2010 He currently also serves as a Senior Fellow at the FDIC Center for Financial Research After loafing and working in many parts of Asia but never really growing up Sanjiv moved to New York to change the world hopefully through research He graduated in 1994 with a PhD from NYU and since then spent five years in Boston and now lives in San Jose California Sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code When there is time available from the excitement of daily life Sanjiv writes academic papers which helps him relax Always the contrarian Sanjiv thinks that New York City is the most calming place in the world after California of course Sanjiv is now a Professor of Finance at Santa Clara University He came to SCU from Harvard Business School and spent a year at UC Berkeley In his past life in the unreal world Sanjiv worked at Citibank NA in the AsiaPacific region He takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse Sanjivs research style is instilled with a distinct New York state of mind it is chaotic diverse with minimal method to the madness He has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas Some years ago he took time off to get another degree in computer science at Berkeley confirming that an unchecked hobby can quickly become an obsession There he learnt about the fascinating field of Randomized Algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of Silicon Valley Coastal living did a lot to mold Sanjiv who needs to live near the ocean The many walks in Greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function He learnt that it is important to open the academic door to the ivory tower and let the world in Academia is a real challenge given that he has to reconcile many more opinions than ideas He has been known to have turned down many offers from Mad magazine to publish his academic work As he often explains you never really finish your education you can check out any time you like but you can never leave Which is why he is doomed to a lifetime in Hotel California And he believes that if this is as bad as it gets life is really pretty good"
text = str_replace_all(text,"nbsp"," ")
text## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara Universitys Leavey School of Business He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley He holds postgraduate degrees in Finance MPhil and PhD from New York University Computer Science MS from UC Berkeley an MBA from the Indian Institute of Management Ahmedabad BCom in Accounting and Economics University of Bombay Sydenham College and is also a qualified Cost and Works Accountant He is a senior editor of The Journal of Investment Management coeditor of The Journal of Derivatives and The Journal of Financial Services Research and Associate Editor of other academic journals Prior to being an academic he worked in the derivatives business in the AsiaPacific region as a VicePresident at Citibank His current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital He has published over ninety articles in academic journals and has won numerous awards for research and teaching His recent book Derivatives Principles and Practice was published in May 2010 He currently also serves as a Senior Fellow at the FDIC Center for Financial Research After loafing and working in many parts of Asia but never really growing up Sanjiv moved to New York to change the world hopefully through research He graduated in 1994 with a PhD from NYU and since then spent five years in Boston and now lives in San Jose California Sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code When there is time available from the excitement of daily life Sanjiv writes academic papers which helps him relax Always the contrarian Sanjiv thinks that New York City is the most calming place in the world after California of course Sanjiv is now a Professor of Finance at Santa Clara University He came to SCU from Harvard Business School and spent a year at UC Berkeley In his past life in the unreal world Sanjiv worked at Citibank NA in the AsiaPacific region He takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse Sanjivs research style is instilled with a distinct New York state of mind it is chaotic diverse with minimal method to the madness He has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas Some years ago he took time off to get another degree in computer science at Berkeley confirming that an unchecked hobby can quickly become an obsession There he learnt about the fascinating field of Randomized Algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of Silicon Valley Coastal living did a lot to mold Sanjiv who needs to live near the ocean The many walks in Greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function He learnt that it is important to open the academic door to the ivory tower and let the world in Academia is a real challenge given that he has to reconcile many more opinions than ideas He has been known to have turned down many offers from Mad magazine to publish his academic work As he often explains you never really finish your education you can check out any time you like but you can never leave Which is why he is doomed to a lifetime in Hotel California And he believes that if this is as bad as it gets life is really pretty good"
text = unlist(strsplit(text," "))
print(text)## [1] "Sanjiv" "Das" "is" "the"
## [5] "William" "and" "Janice" "Terry"
## [9] "Professor" "of" "Finance" "at"
## [13] "Santa" "Clara" "Universitys" "Leavey"
## [17] "School" "of" "Business" "He"
## [21] "previously" "held" "faculty" "appointments"
## [25] "as" "Associate" "Professor" "at"
## [29] "Harvard" "Business" "School" "and"
## [33] "UC" "Berkeley" "He" "holds"
## [37] "postgraduate" "degrees" "in" "Finance"
## [41] "MPhil" "and" "PhD" "from"
## [45] "New" "York" "University" "Computer"
## [49] "Science" "MS" "from" "UC"
## [53] "Berkeley" "an" "MBA" "from"
## [57] "the" "Indian" "Institute" "of"
## [61] "Management" "Ahmedabad" "BCom" "in"
## [65] "Accounting" "and" "Economics" "University"
## [69] "of" "Bombay" "Sydenham" "College"
## [73] "and" "is" "also" "a"
## [77] "qualified" "Cost" "and" "Works"
## [81] "Accountant" "He" "is" "a"
## [85] "senior" "editor" "of" "The"
## [89] "Journal" "of" "Investment" "Management"
## [93] "coeditor" "of" "The" "Journal"
## [97] "of" "Derivatives" "and" "The"
## [101] "Journal" "of" "Financial" "Services"
## [105] "Research" "and" "Associate" "Editor"
## [109] "of" "other" "academic" "journals"
## [113] "Prior" "to" "being" "an"
## [117] "academic" "he" "worked" "in"
## [121] "the" "derivatives" "business" "in"
## [125] "the" "AsiaPacific" "region" "as"
## [129] "a" "VicePresident" "at" "Citibank"
## [133] "His" "current" "research" "interests"
## [137] "include" "the" "modeling" "of"
## [141] "default" "risk" "machine" "learning"
## [145] "social" "networks" "derivatives" "pricing"
## [149] "models" "portfolio" "theory" "and"
## [153] "venture" "capital" "He" "has"
## [157] "published" "over" "ninety" "articles"
## [161] "in" "academic" "journals" "and"
## [165] "has" "won" "numerous" "awards"
## [169] "for" "research" "and" "teaching"
## [173] "His" "recent" "book" "Derivatives"
## [177] "Principles" "and" "Practice" "was"
## [181] "published" "in" "May" "2010"
## [185] "" "He" "currently" "also"
## [189] "serves" "as" "a" "Senior"
## [193] "Fellow" "at" "the" "FDIC"
## [197] "Center" "for" "Financial" "Research"
## [201] "After" "loafing" "and" "working"
## [205] "in" "many" "parts" "of"
## [209] "Asia" "but" "never" "really"
## [213] "growing" "up" "Sanjiv" "moved"
## [217] "to" "New" "York" "to"
## [221] "change" "the" "world" "hopefully"
## [225] "through" "research" "" "He"
## [229] "graduated" "in" "1994" "with"
## [233] "a" "PhD" "from" "NYU"
## [237] "and" "since" "then" "spent"
## [241] "five" "years" "in" "Boston"
## [245] "and" "now" "lives" "in"
## [249] "San" "Jose" "California" ""
## [253] "Sanjiv" "loves" "animals" "places"
## [257] "in" "the" "world" "where"
## [261] "the" "mountains" "meet" "the"
## [265] "sea" "riding" "sport" "motorbikes"
## [269] "reading" "gadgets" "science" "fiction"
## [273] "movies" "and" "writing" "cool"
## [277] "software" "code" "When" "there"
## [281] "is" "time" "available" "from"
## [285] "the" "excitement" "of" "daily"
## [289] "life" "Sanjiv" "writes" "academic"
## [293] "papers" "which" "helps" "him"
## [297] "relax" "Always" "the" "contrarian"
## [301] "Sanjiv" "thinks" "that" "New"
## [305] "York" "City" "is" "the"
## [309] "most" "calming" "place" "in"
## [313] "the" "world" "after" "California"
## [317] "of" "course" "Sanjiv" "is"
## [321] "now" "a" "Professor" "of"
## [325] "Finance" "at" "Santa" "Clara"
## [329] "University" "He" "came" "to"
## [333] "SCU" "from" "Harvard" "Business"
## [337] "School" "and" "spent" "a"
## [341] "year" "at" "UC" "Berkeley"
## [345] "In" "his" "past" "life"
## [349] "in" "the" "unreal" "world"
## [353] "Sanjiv" "worked" "at" "Citibank"
## [357] "NA" "in" "the" "AsiaPacific"
## [361] "region" "He" "takes" "great"
## [365] "pleasure" "in" "merging" "his"
## [369] "many" "previous" "lives" "into"
## [373] "his" "current" "existence" "which"
## [377] "is" "incredibly" "confused" "and"
## [381] "diverse" "Sanjivs" "research" "style"
## [385] "is" "instilled" "with" "a"
## [389] "distinct" "New" "York" "state"
## [393] "of" "mind" "" "it"
## [397] "is" "chaotic" "diverse" "with"
## [401] "minimal" "method" "to" "the"
## [405] "madness" "He" "has" "published"
## [409] "articles" "on" "derivatives" "termstructure"
## [413] "models" "mutual" "funds" "the"
## [417] "internet" "portfolio" "choice" "banking"
## [421] "models" "credit" "risk" "and"
## [425] "has" "unpublished" "articles" "in"
## [429] "many" "other" "areas" "Some"
## [433] "years" "ago" "he" "took"
## [437] "time" "off" "to" "get"
## [441] "another" "degree" "in" "computer"
## [445] "science" "at" "Berkeley" "confirming"
## [449] "that" "an" "unchecked" "hobby"
## [453] "can" "quickly" "become" "an"
## [457] "obsession" "There" "he" "learnt"
## [461] "about" "the" "fascinating" "field"
## [465] "of" "Randomized" "Algorithms" "skills"
## [469] "he" "now" "applies" "earnestly"
## [473] "to" "his" "editorial" "work"
## [477] "and" "other" "pursuits" "many"
## [481] "of" "which" "stem" "from"
## [485] "being" "in" "the" "epicenter"
## [489] "of" "Silicon" "Valley" "Coastal"
## [493] "living" "did" "a" "lot"
## [497] "to" "mold" "Sanjiv" "who"
## [501] "needs" "to" "live" "near"
## [505] "the" "ocean" "" "The"
## [509] "many" "walks" "in" "Greenwich"
## [513] "village" "convinced" "him" "that"
## [517] "there" "is" "no" "such"
## [521] "thing" "as" "a" "representative"
## [525] "investor" "yet" "added" "many"
## [529] "unique" "features" "to" "his"
## [533] "personal" "utility" "function" "He"
## [537] "learnt" "that" "it" "is"
## [541] "important" "to" "open" "the"
## [545] "academic" "door" "to" "the"
## [549] "ivory" "tower" "and" "let"
## [553] "the" "world" "in" "Academia"
## [557] "is" "a" "real" "challenge"
## [561] "given" "that" "he" "has"
## [565] "to" "reconcile" "many" "more"
## [569] "opinions" "than" "ideas" "He"
## [573] "has" "been" "known" "to"
## [577] "have" "turned" "down" "many"
## [581] "offers" "from" "Mad" "magazine"
## [585] "to" "publish" "his" "academic"
## [589] "work" "As" "he" "often"
## [593] "explains" "you" "never" "really"
## [597] "finish" "your" "education" ""
## [601] "you" "can" "check" "out"
## [605] "any" "time" "you" "like"
## [609] "but" "you" "can" "never"
## [613] "leave" "Which" "is" "why"
## [617] "he" "is" "doomed" "to"
## [621] "a" "lifetime" "in" "Hotel"
## [625] "California" "And" "he" "believes"
## [629] "that" "if" "this" "is"
## [633] "as" "bad" "as" "it"
## [637] "gets" "life" "is" "really"
## [641] "pretty" "good"
posmatch = match(text,poswords)
numposmatch = length(posmatch[which(posmatch>0)])
negmatch = match(text,negwords)
numnegmatch = length(negmatch[which(negmatch>0)])
print(c(numposmatch,numnegmatch))## [1] 26 16
#FURTHER EXPLORATION OF THESE OBJECTS
print(length(text))## [1] 642
print(posmatch)## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [15] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [29] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [43] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [57] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [71] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [85] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [99] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [113] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [127] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [141] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [155] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [169] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [183] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [197] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [211] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [225] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [239] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [253] NA NA NA NA NA NA NA NA NA NA 994 NA NA NA
## [267] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [281] NA NA NA NA NA 611 NA NA NA NA NA NA NA NA
## [295] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [309] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [323] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [337] NA NA NA NA NA NA NA NA NA 800 NA NA NA NA
## [351] NA NA NA NA NA NA NA NA NA NA NA NA NA 761
## [365] 1144 NA NA 800 NA NA NA NA 800 NA NA NA NA NA
## [379] NA NA NA NA NA NA NA NA NA NA 515 NA NA NA
## [393] NA 1011 NA NA NA NA NA NA NA NA NA NA NA NA
## [407] NA NA NA NA NA NA NA 1036 NA NA NA NA NA NA
## [421] NA 455 NA NA NA NA NA NA NA NA NA NA NA NA
## [435] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [449] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [463] NA NA NA NA NA NA NA NA NA NA NA 800 NA NA
## [477] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [491] NA NA NA NA NA NA NA NA NA NA NA NA 941 NA
## [505] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [519] NA NA NA NA NA NA NA NA NA NA 1571 NA NA 800
## [533] NA NA NA NA NA NA NA NA 838 NA 1076 NA NA NA
## [547] NA NA NA NA NA NA NA NA NA NA NA NA 1255 NA
## [561] NA NA NA NA NA 1266 NA NA NA NA NA NA NA NA
## [575] NA NA 781 NA NA NA NA NA NA NA NA NA 800 NA
## [589] NA NA NA NA NA NA NA NA NA 1645 542 NA NA NA
## [603] NA NA NA NA NA 940 NA NA NA NA NA NA NA NA
## [617] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [631] NA NA NA NA NA NA NA NA NA NA 1184 747
print(text[77])## [1] "qualified"
print(poswords[204])## [1] "back"
is.na(posmatch)## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [34] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [89] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [100] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [111] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [122] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [144] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [155] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [177] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [188] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [199] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [210] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [221] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [232] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [243] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [254] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [265] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [276] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [287] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [298] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [309] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [320] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [342] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [353] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [364] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [375] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [386] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [397] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [408] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [419] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [430] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [441] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [452] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [463] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [474] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [485] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [496] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [507] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [518] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [529] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [540] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [551] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [562] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [573] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [584] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [595] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [606] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [617] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [628] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [639] TRUE TRUE FALSE FALSE
We may be scraping web sites from many countries and need to detect the language and then translate it into English for mood scoring. The useful package textcat enables us to categorize the language.
library(textcat)
text = c("Je suis un programmeur novice.",
"I am a programmer who is a novice.",
"Sono un programmatore alle prime armi.",
"Ich bin ein Anfänger Programmierer",
"Soy un programador con errores.")
lang = textcat(text)
print(lang)## [1] "french" "english" "italian" "german" "spanish"
And of course, once the language is detected, we may translate it into English.
library(translate)
set.key("AIzaSyDIB8qQTmhLlbPNN38Gs4dXnlN4a7lRrHQ")
print(translate(text[1],"fr","en"))## list()
print(translate(text[3],"it","en"))## list()
print(translate(text[4],"de","en"))## list()
print(translate(text[5],"es","en"))## list()
This requires a Google API for which you need to set up a paid account.
Machine classification is, from a layman’s point of view, nothing but learning by example. In new-fangled modern parlance, it is a technique in the field of “machine learning”.
Learning by machines falls into two categories, supervised and unsupervised. When a number of explanatory \(X\) variables are used to determine some outcome \(Y\), and we train an algorithm to do this, we are performing supervised (machine) learning. The outcome \(Y\) may be a dependent variable (for example, the left hand side in a linear regression), or a classification (i.e., discrete outcome).
When we only have \(X\) variables and no separate outcome variable \(Y\), we perform unsupervised learning. For example, cluster analysis produces groupings based on the \(X\) variables of various entities, and is a common example.
We start with a simple example on numerical data befoe discussing how this is to be applied to text. We first look at the Bayes classifier.
Bayes classification extends the Document-Term model with a document-term-classification model. These are the three entities in the model and we denote them as \((d,t,c)\). Assume that there are \(D\) documents to classify into \(C\) categories, and we employ a dictionary/lexicon (as the case may be) of \(T\) terms or words. Hence we have \(d_i, i = 1, ... , D\), and \(t_j, j = 1, ... , T\). And correspondingly the categories for classification are \(c_k, k = 1, ... , C\).
Suppose we are given a text corpus of stock market related documents (tweets for example), and wish to classify them into bullish (\(c_1\)), neutral (\(c_2\)), or bearish (\(c_3\)), where \(C=3\). We first need to train the Bayes classifier using a training data set, with pre-classified documents, numbering \(D\). For each term \(t\) in the lexicon, we can compute how likely it is to appear in documents in each class \(c_k\). Therefore, for each class, there is a \(T\)-sided dice with each face representing a term and having a probability of coming up. These dice are the prior probabilities of seeing a word for each class of document. We denote these probabilities succinctly as \(p(t | c)\). For example in a bearish document, if the word “sell” comprises 10% of the words that appear, then \(p(t=\mbox{sell} | c=\mbox{bearish})=0.10\).
In order to ensure that just because a word does not appear in a class, it has a non-zero probability we compute the probabilities as follows:
\[ \begin{equation} p(t | c) = \frac{n(t | c) + 1}{n(c)+T} \end{equation} \]
where \(n(t | c)\) is the number of times word \(t\) appears in category \(c\), and \(n(c) = \sum_t n(t | c)\) is the total number of words in the training data in class \(c\). Note that if there are no words in the class \(c\), then each term \(t\) has probability \(1/T\).
A document \(d_i\) is a collection or set of words \(t_j\). The probability of seeing a given document in each category is given by the following multinomial probability:
\[ \begin{equation} p(d | c) = \frac{n(d)!}{n(t_1|d)! \cdot n(t_2|d)! \cdots n(t_T|d)!} \times p(t_1 | c) \cdot p(t_2 | c) \cdots p(t_T | c) \nonumber \end{equation} \]
where \(n(d)\) is the number of words in the document, and \(n(t_j | d)\) is the number of occurrences of word \(t_j\) in the same document \(d\). These \(p(d | c)\) are the prior probabilities in the Bayes classifier, computed from all documents in the training data. The posterior probabilities are computed for each document in the test data as follows:
\[ \begin{equation} p(c | d) = \frac{p(d | c) p(c)}{\sum_k \; p(d | c_k) p(c_k)}, \forall k = 1, \ldots, C \nonumber \end{equation} \]
Note that we get \(C\) posterior probabilities for document \(d\), and assign the document to class \(\max_k c_k\), i.e., the class with the highest posterior probability for the given document.
We use the e1071 package. It has a one-line command that takes in the tagged training dataset using the function naiveBayes(). It returns the trained classifier model.
The trained classifier contains the unconditional probabilities \(p(c)\) of each class, which are merely frequencies with which each document appears. It also shows the conditional probability distributions \(p(t |c)\) given as the mean and standard deviation of the occurrence of these terms in each class. We may take this trained model and re-apply to the training data set to see how well it does. We use the predict() function for this. The data set here is the classic Iris data.
For text mining, the feature set in the data will be the set of all words, and there will be one column for each word. Hence, this will be a large feature set. In order to keep this small, we may instead reduce the number of words by only using a lexicon’s words as the set of features. This will vastly reduce and make more specific the feature set used in the classifier.
library(e1071)
data(iris)
print(head(iris))## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
#NAIVE BAYES
res = naiveBayes(iris[,1:4],iris[,5])
#SHOWS THE PRIOR AND LIKELIHOOD FUNCTIONS
res##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = iris[, 1:4], y = iris[, 5])
##
## A-priori probabilities:
## iris[, 5]
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Conditional probabilities:
## Sepal.Length
## iris[, 5] [,1] [,2]
## setosa 5.006 0.3524897
## versicolor 5.936 0.5161711
## virginica 6.588 0.6358796
##
## Sepal.Width
## iris[, 5] [,1] [,2]
## setosa 3.428 0.3790644
## versicolor 2.770 0.3137983
## virginica 2.974 0.3224966
##
## Petal.Length
## iris[, 5] [,1] [,2]
## setosa 1.462 0.1736640
## versicolor 4.260 0.4699110
## virginica 5.552 0.5518947
##
## Petal.Width
## iris[, 5] [,1] [,2]
## setosa 0.246 0.1053856
## versicolor 1.326 0.1977527
## virginica 2.026 0.2746501
#SHOWS POSTERIOR PROBABILITIES
predict(res,iris[,1:4],type="raw")## setosa versicolor virginica
## [1,] 1.000000e+00 2.981309e-18 2.152373e-25
## [2,] 1.000000e+00 3.169312e-17 6.938030e-25
## [3,] 1.000000e+00 2.367113e-18 7.240956e-26
## [4,] 1.000000e+00 3.069606e-17 8.690636e-25
## [5,] 1.000000e+00 1.017337e-18 8.885794e-26
## [6,] 1.000000e+00 2.717732e-14 4.344285e-21
## [7,] 1.000000e+00 2.321639e-17 7.988271e-25
## [8,] 1.000000e+00 1.390751e-17 8.166995e-25
## [9,] 1.000000e+00 1.990156e-17 3.606469e-25
## [10,] 1.000000e+00 7.378931e-18 3.615492e-25
## [11,] 1.000000e+00 9.396089e-18 1.474623e-24
## [12,] 1.000000e+00 3.461964e-17 2.093627e-24
## [13,] 1.000000e+00 2.804520e-18 1.010192e-25
## [14,] 1.000000e+00 1.799033e-19 6.060578e-27
## [15,] 1.000000e+00 5.533879e-19 2.485033e-25
## [16,] 1.000000e+00 6.273863e-17 4.509864e-23
## [17,] 1.000000e+00 1.106658e-16 1.282419e-23
## [18,] 1.000000e+00 4.841773e-17 2.350011e-24
## [19,] 1.000000e+00 1.126175e-14 2.567180e-21
## [20,] 1.000000e+00 1.808513e-17 1.963924e-24
## [21,] 1.000000e+00 2.178382e-15 2.013989e-22
## [22,] 1.000000e+00 1.210057e-15 7.788592e-23
## [23,] 1.000000e+00 4.535220e-20 3.130074e-27
## [24,] 1.000000e+00 3.147327e-11 8.175305e-19
## [25,] 1.000000e+00 1.838507e-14 1.553757e-21
## [26,] 1.000000e+00 6.873990e-16 1.830374e-23
## [27,] 1.000000e+00 3.192598e-14 1.045146e-21
## [28,] 1.000000e+00 1.542562e-17 1.274394e-24
## [29,] 1.000000e+00 8.833285e-18 5.368077e-25
## [30,] 1.000000e+00 9.557935e-17 3.652571e-24
## [31,] 1.000000e+00 2.166837e-16 6.730536e-24
## [32,] 1.000000e+00 3.940500e-14 1.546678e-21
## [33,] 1.000000e+00 1.609092e-20 1.013278e-26
## [34,] 1.000000e+00 7.222217e-20 4.261853e-26
## [35,] 1.000000e+00 6.289348e-17 1.831694e-24
## [36,] 1.000000e+00 2.850926e-18 8.874002e-26
## [37,] 1.000000e+00 7.746279e-18 7.235628e-25
## [38,] 1.000000e+00 8.623934e-20 1.223633e-26
## [39,] 1.000000e+00 4.612936e-18 9.655450e-26
## [40,] 1.000000e+00 2.009325e-17 1.237755e-24
## [41,] 1.000000e+00 1.300634e-17 5.657689e-25
## [42,] 1.000000e+00 1.577617e-15 5.717219e-24
## [43,] 1.000000e+00 1.494911e-18 4.800333e-26
## [44,] 1.000000e+00 1.076475e-10 3.721344e-18
## [45,] 1.000000e+00 1.357569e-12 1.708326e-19
## [46,] 1.000000e+00 3.882113e-16 5.587814e-24
## [47,] 1.000000e+00 5.086735e-18 8.960156e-25
## [48,] 1.000000e+00 5.012793e-18 1.636566e-25
## [49,] 1.000000e+00 5.717245e-18 8.231337e-25
## [50,] 1.000000e+00 7.713456e-18 3.349997e-25
## [51,] 4.893048e-107 8.018653e-01 1.981347e-01
## [52,] 7.920550e-100 9.429283e-01 5.707168e-02
## [53,] 5.494369e-121 4.606254e-01 5.393746e-01
## [54,] 1.129435e-69 9.999621e-01 3.789964e-05
## [55,] 1.473329e-105 9.503408e-01 4.965916e-02
## [56,] 1.931184e-89 9.990013e-01 9.986538e-04
## [57,] 4.539099e-113 6.592515e-01 3.407485e-01
## [58,] 2.549753e-34 9.999997e-01 3.119517e-07
## [59,] 6.562814e-97 9.895385e-01 1.046153e-02
## [60,] 5.000210e-69 9.998928e-01 1.071638e-04
## [61,] 7.354548e-41 9.999997e-01 3.143915e-07
## [62,] 4.799134e-86 9.958564e-01 4.143617e-03
## [63,] 4.631287e-60 9.999925e-01 7.541274e-06
## [64,] 1.052252e-103 9.850868e-01 1.491324e-02
## [65,] 4.789799e-55 9.999700e-01 2.999393e-05
## [66,] 1.514706e-92 9.787587e-01 2.124125e-02
## [67,] 1.338348e-97 9.899311e-01 1.006893e-02
## [68,] 2.026115e-62 9.999799e-01 2.007314e-05
## [69,] 6.547473e-101 9.941996e-01 5.800427e-03
## [70,] 3.016276e-58 9.999913e-01 8.739959e-06
## [71,] 1.053341e-127 1.609361e-01 8.390639e-01
## [72,] 1.248202e-70 9.997743e-01 2.256698e-04
## [73,] 3.294753e-119 9.245812e-01 7.541876e-02
## [74,] 1.314175e-95 9.979398e-01 2.060233e-03
## [75,] 3.003117e-83 9.982736e-01 1.726437e-03
## [76,] 2.536747e-92 9.865372e-01 1.346281e-02
## [77,] 1.558909e-111 9.102260e-01 8.977398e-02
## [78,] 7.014282e-136 7.989607e-02 9.201039e-01
## [79,] 5.034528e-99 9.854957e-01 1.450433e-02
## [80,] 1.439052e-41 9.999984e-01 1.601574e-06
## [81,] 1.251567e-54 9.999955e-01 4.500139e-06
## [82,] 8.769539e-48 9.999983e-01 1.742560e-06
## [83,] 3.447181e-62 9.999664e-01 3.361987e-05
## [84,] 1.087302e-132 6.134355e-01 3.865645e-01
## [85,] 4.119852e-97 9.918297e-01 8.170260e-03
## [86,] 1.140835e-102 8.734107e-01 1.265893e-01
## [87,] 2.247339e-110 7.971795e-01 2.028205e-01
## [88,] 4.870630e-88 9.992978e-01 7.022084e-04
## [89,] 2.028672e-72 9.997620e-01 2.379898e-04
## [90,] 2.227900e-69 9.999461e-01 5.390514e-05
## [91,] 5.110709e-81 9.998510e-01 1.489819e-04
## [92,] 5.774841e-99 9.885399e-01 1.146006e-02
## [93,] 5.146736e-66 9.999591e-01 4.089540e-05
## [94,] 1.332816e-34 9.999997e-01 2.716264e-07
## [95,] 6.094144e-77 9.998034e-01 1.966331e-04
## [96,] 1.424276e-72 9.998236e-01 1.764463e-04
## [97,] 8.302641e-77 9.996692e-01 3.307548e-04
## [98,] 1.835520e-82 9.988601e-01 1.139915e-03
## [99,] 5.710350e-30 9.999997e-01 3.094739e-07
## [100,] 3.996459e-73 9.998204e-01 1.795726e-04
## [101,] 3.993755e-249 1.031032e-10 1.000000e+00
## [102,] 1.228659e-149 2.724406e-02 9.727559e-01
## [103,] 2.460661e-216 2.327488e-07 9.999998e-01
## [104,] 2.864831e-173 2.290954e-03 9.977090e-01
## [105,] 8.299884e-214 3.175384e-07 9.999997e-01
## [106,] 1.371182e-267 3.807455e-10 1.000000e+00
## [107,] 3.444090e-107 9.719885e-01 2.801154e-02
## [108,] 3.741929e-224 1.782047e-06 9.999982e-01
## [109,] 5.564644e-188 5.823191e-04 9.994177e-01
## [110,] 2.052443e-260 2.461662e-12 1.000000e+00
## [111,] 8.669405e-159 4.895235e-04 9.995105e-01
## [112,] 4.220200e-163 3.168643e-03 9.968314e-01
## [113,] 4.360059e-190 6.230821e-06 9.999938e-01
## [114,] 6.142256e-151 1.423414e-02 9.857659e-01
## [115,] 2.201426e-186 1.393247e-06 9.999986e-01
## [116,] 2.949945e-191 6.128385e-07 9.999994e-01
## [117,] 2.909076e-168 2.152843e-03 9.978472e-01
## [118,] 1.347608e-281 2.872996e-12 1.000000e+00
## [119,] 2.786402e-306 1.151469e-12 1.000000e+00
## [120,] 2.082510e-123 9.561626e-01 4.383739e-02
## [121,] 2.194169e-217 1.712166e-08 1.000000e+00
## [122,] 3.325791e-145 1.518718e-02 9.848128e-01
## [123,] 6.251357e-269 1.170872e-09 1.000000e+00
## [124,] 4.415135e-135 1.360432e-01 8.639568e-01
## [125,] 6.315716e-201 1.300512e-06 9.999987e-01
## [126,] 5.257347e-203 9.507989e-06 9.999905e-01
## [127,] 1.476391e-129 2.067703e-01 7.932297e-01
## [128,] 8.772841e-134 1.130589e-01 8.869411e-01
## [129,] 5.230800e-194 1.395719e-05 9.999860e-01
## [130,] 7.014892e-179 8.232518e-04 9.991767e-01
## [131,] 6.306820e-218 1.214497e-06 9.999988e-01
## [132,] 2.539020e-247 4.668891e-10 1.000000e+00
## [133,] 2.210812e-201 2.000316e-06 9.999980e-01
## [134,] 1.128613e-128 7.118948e-01 2.881052e-01
## [135,] 8.114869e-151 4.900992e-01 5.099008e-01
## [136,] 7.419068e-249 1.448050e-10 1.000000e+00
## [137,] 1.004503e-215 9.743357e-09 1.000000e+00
## [138,] 1.346716e-167 2.186989e-03 9.978130e-01
## [139,] 1.994716e-128 1.999894e-01 8.000106e-01
## [140,] 8.440466e-185 6.769126e-06 9.999932e-01
## [141,] 2.334365e-218 7.456220e-09 1.000000e+00
## [142,] 2.179139e-183 6.352663e-07 9.999994e-01
## [143,] 1.228659e-149 2.724406e-02 9.727559e-01
## [144,] 3.426814e-229 6.597015e-09 1.000000e+00
## [145,] 2.011574e-232 2.620636e-10 1.000000e+00
## [146,] 1.078519e-187 7.915543e-07 9.999992e-01
## [147,] 1.061392e-146 2.770575e-02 9.722942e-01
## [148,] 1.846900e-164 4.398402e-04 9.995602e-01
## [149,] 1.439996e-195 3.384156e-07 9.999997e-01
## [150,] 2.771480e-143 5.987903e-02 9.401210e-01
#CONFUSION MATRIX
out = table(predict(res,iris[,1:4]),iris[,5])
out##
## setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 47 3
## virginica 0 3 47
The goal of the SVM is to map a set of entities with inputs \(X=\{x_1,x_2,\ldots,x_n\}\) of dimension \(n\), i.e., \(X \in R^n\), into a set of categories \(Y=\{y_1,y_2,\ldots,y_m\}\) of dimension \(m\), such that the \(n\)-dimensional \(X\)-space is divided using hyperplanes, which result in the maximal separation between classes \(Y\). A hyperplane is the set of points \({\bf x}\) satisfying the equation
\[ {\bf w} \cdot {\bf x} = b \]
where \(b\) is a scalar constant, and \({\bf w} \in R^n\) is the normal vector to the hyperplane, i.e., the vector at right angles to the plane. The distance between this hyperplane and \({\bf w} \cdot {\bf x} = 0\) is given by \(b/||{\bf w}||\), where \(||{\bf w}||\) is the norm of vector \({\bf w}\).
This set up is sufficient to provide intuition about how the SVM is implemented. Suppose we have two categories of data, i.e., \(y = \{y_1, y_2\}\). Assume that all points in category \(y_1\) lie above a hyperplane \({\bf w} \cdot {\bf x} = b_1\), and all points in category \(y_2\) lie below a hyperplane \({\bf w} \cdot {\bf x} = b_2\), then the distance between the two hyperplanes is \(\frac{|b_1-b_2|}{||{\bf w}||}\).
#Example of hyperplane geometry
w1 = 1; w2 = 2
b1 = 10
#Plot hyperplane in x1, x2 space
x1 = seq(-3,3,0.1)
x2 = (b1-w1*x1)/w2
plot(x1,x2,type="l")
#Create hyperplane 2
b2 = 8
x2 = (b2-w1*x1)/w2
lines(x1,x2,col="red")#Compute distance to hyperplane 2
print(abs(b1-b2)/sqrt(w1^2+w2^2))## [1] 0.8944272
We see that this gives the perpendicular distance between the two parallel hyperplanes.
The goal of the SVM is to maximize the distance (separation) between the two hyperplanes, and this is achieved by minimizing norm \(||{\bf w}||\). This naturally leads to a quadratic optimization problem.
\[ \begin{equation} \min_{b_1,b_2,{\bf w}} \frac{1}{2} ||{\bf w}|| \end{equation} \]
subject to \({\bf w} \cdot {\bf x} \geq b_1\) for points in category \(y_1\) and \({\bf w} \cdot {\bf x} \leq b_2\) for points in category \(y_2\). Note that this program may find a solution where many of the elements of \({\bf w}\) are zero, i.e., it also finds the minimal set of “support” vectors that separate the two groups. The “half” in front of the minimand is for mathematical convenience in solving the quadratic program.
Of course, there may be no linear hyperplane that perfectly separates the two groups. This slippage may be accounted for in the SVM by allowing for points on the wrong side of the separating hyperplanes using cost functions, i.e., we modify the quadratic program as follows:
\[ \begin{equation} \min_{b_1,b_2,{\bf w},\{\eta_i\}} \frac{1}{2} ||{\bf w}|| + C_1 \sum_{i=1}^n \eta_i + C_2 \sum_{i=1}^n \eta_i \end{equation} \] where \(C_1,C_2\) are the costs for slippage in groups 1 and 2, respectively. Often implementations assume \(C_1=C_2\). The values \(\eta_i\) are positive for observations that are not perfectly separated, i.e., lead to slippage. Thus, for group 1, these are the length of the perpendicular amounts by which observation \(i\) lies below the hyperplane \({\bf w} \cdot {\bf x} = b_1\), i.e., lies on the hyperplane \({\bf w} \cdot {\bf x} = b_1 - \eta_i\). For group 1, these are the length of the perpendicular amounts by which observation \(i\) lies above the hyperplane \({\bf w} \cdot {\bf x} = b_2\), i.e., lies on the hyperplane \({\bf w} \cdot {\bf x} = b_1 + \eta_i\). For observations within the respective hyperplanes, of course, \(\eta_i=0\).
library(e1071)
#EXAMPLE 1 for SVM
model = svm(iris[,1:4],iris[,5])
model##
## Call:
## svm.default(x = iris[, 1:4], y = iris[, 5])
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.25
##
## Number of Support Vectors: 51
out = predict(model,iris[,1:4])
out## 1 2 3 4 5 6
## setosa setosa setosa setosa setosa setosa
## 7 8 9 10 11 12
## setosa setosa setosa setosa setosa setosa
## 13 14 15 16 17 18
## setosa setosa setosa setosa setosa setosa
## 19 20 21 22 23 24
## setosa setosa setosa setosa setosa setosa
## 25 26 27 28 29 30
## setosa setosa setosa setosa setosa setosa
## 31 32 33 34 35 36
## setosa setosa setosa setosa setosa setosa
## 37 38 39 40 41 42
## setosa setosa setosa setosa setosa setosa
## 43 44 45 46 47 48
## setosa setosa setosa setosa setosa setosa
## 49 50 51 52 53 54
## setosa setosa versicolor versicolor versicolor versicolor
## 55 56 57 58 59 60
## versicolor versicolor versicolor versicolor versicolor versicolor
## 61 62 63 64 65 66
## versicolor versicolor versicolor versicolor versicolor versicolor
## 67 68 69 70 71 72
## versicolor versicolor versicolor versicolor versicolor versicolor
## 73 74 75 76 77 78
## versicolor versicolor versicolor versicolor versicolor virginica
## 79 80 81 82 83 84
## versicolor versicolor versicolor versicolor versicolor virginica
## 85 86 87 88 89 90
## versicolor versicolor versicolor versicolor versicolor versicolor
## 91 92 93 94 95 96
## versicolor versicolor versicolor versicolor versicolor versicolor
## 97 98 99 100 101 102
## versicolor versicolor versicolor versicolor virginica virginica
## 103 104 105 106 107 108
## virginica virginica virginica virginica virginica virginica
## 109 110 111 112 113 114
## virginica virginica virginica virginica virginica virginica
## 115 116 117 118 119 120
## virginica virginica virginica virginica virginica versicolor
## 121 122 123 124 125 126
## virginica virginica virginica virginica virginica virginica
## 127 128 129 130 131 132
## virginica virginica virginica virginica virginica virginica
## 133 134 135 136 137 138
## virginica versicolor virginica virginica virginica virginica
## 139 140 141 142 143 144
## virginica virginica virginica virginica virginica virginica
## 145 146 147 148 149 150
## virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
print(length(out))## [1] 150
table(matrix(out),iris[,5])##
## setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 2 48
So it does marginally better than naive Bayes. Here is another example.
#EXAMPLE 2 for SVM
train_data = matrix(rpois(60,3),10,6)
print(train_data)## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 6 1 6 3 5 4
## [2,] 2 2 2 3 2 4
## [3,] 3 0 4 2 4 5
## [4,] 3 3 4 4 1 3
## [5,] 4 7 6 3 4 0
## [6,] 6 1 1 2 4 2
## [7,] 1 5 4 3 3 6
## [8,] 4 1 3 5 2 3
## [9,] 3 3 4 4 3 4
## [10,] 1 4 4 4 6 6
train_class = as.matrix(c(2,3,1,2,2,1,3,2,3,3))
print(train_class)## [,1]
## [1,] 2
## [2,] 3
## [3,] 1
## [4,] 2
## [5,] 2
## [6,] 1
## [7,] 3
## [8,] 2
## [9,] 3
## [10,] 3
library(e1071)
model = svm(train_data,train_class)
model##
## Call:
## svm.default(x = train_data, y = train_class)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1666667
## epsilon: 0.1
##
##
## Number of Support Vectors: 9
pred = predict(model,train_data, type="raw")
table(pred,train_class)## train_class
## pred 1 2 3
## 1.29176440936163 1 0 0
## 1.68430545397683 1 0 0
## 1.9381669826315 0 1 0
## 2.07882376164611 0 1 0
## 2.07885763288189 0 1 0
## 2.12319491068171 0 1 0
## 2.60099597219594 0 0 1
## 2.66874438847281 0 0 1
## 2.921095532933 0 0 1
## 2.92122308190957 0 0 1
train_fitted = round(pred,0)
print(cbind(train_class,train_fitted))## train_fitted
## 1 2 2
## 2 3 3
## 3 1 2
## 4 2 2
## 5 2 2
## 6 1 1
## 7 3 3
## 8 2 2
## 9 3 3
## 10 3 3
train_fitted = matrix(train_fitted)
table(train_class,train_fitted)## train_fitted
## train_class 1 2 3
## 1 1 1 0
## 2 0 4 0
## 3 0 0 4
How do we know if the confusion matrix shows statistically significant classification power? We do a chi-square test.
library(e1071)
res = naiveBayes(iris[,1:4],iris[,5])
pred = predict(res,iris[,1:4])
out = table(pred,iris[,5])
out##
## pred setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 47 3
## virginica 0 3 47
chisq.test(out)##
## Pearson's Chi-squared test
##
## data: out
## X-squared = 266.16, df = 4, p-value < 2.2e-16
Given a lexicon of selected words, one may sign the words as positive or negative, and then do a simple word count to compute net sentiment or mood of text. By establishing appropriate cut offs, one can determine the classification of text into optimistic, neutral, or pessimistic. These cut offs are determined using the training and testing data sets.
Word count classifiers may be enhanced by focusing on “emphasis words” such as adjectives and adverbs, especially when classifying emotive content. One approach used in Das and Chen (2007) is to identify all adjectives and adverbs in the text and then only consider words that are within \(\pm 3\) words before and after the adjective or adverb. This extracts the most emphatic parts of the text only, and then mood scores it.
Fisher’s discriminant is simply the ratio of the variation of a given word across groups to the variation within group.
More formally, Fisher’s discriminant score \(F(w)\) for word \(w\) is
\[ \begin{equation} F(w) = \frac{\frac{1}{K} \sum_{j=1}^K ({\bar w}_j - {\bar w}_0)^2}{\frac{1}{K} \sum_{j=1}^K \sigma_j^2} \nonumber \end{equation} \]
where \(K\) is the number of categories and \({\bar w}_j\) is the mean occurrence of the word \(w\) in each text in category \(j\), and \({\bar w}_0\) is the mean occurrence across all categories. And \(\sigma_j^2\) is the variance of the word occurrence in category \(j\). This is just one way in which Fisher’s discriminant may be calculated, and there are other variations on the theme.
Suppose we have 500 documents in each of two categories, bullish and bearish. These 1,000 documents may all be placed as points in \(n\)-dimensional space. It is more than likely that the points in each category will lie closer to each other than to the points in the other category. Now, if we wish to classify a new document, with vector \(D_i\), the obvious idea is to look at which cluster it is closest to, or which point in either cluster it is closest to. The closeness between two documents \(i\) and \(j\) is determined easily by the well known metric of cosine distance, i.e.,
\[ \begin{equation} 1 - \cos(\theta_{ij}) = 1 - \frac{D_i^\top D_j}{||D_i|| \cdot ||D_j||} \nonumber \end{equation} \]
where \(||D_i|| = \sqrt{D_i^\top D_i}\) is the norm of the vector \(D_i\). The cosine of the angle between the two document vectors is 1 if the two vectors are identical, and in this case the distance between them would be zero.
The confusion matrix is the classic tool for assessing classification accuracy. Given \(n\) categories, the matrix is of dimension \(n \times n\). The rows relate to the category assigned by the analytic algorithm and the columns refer to the correct category in which the text resides. Each cell \((i,j)\) of the matrix contains the number of text messages that were of type \(j\) and were classified as type \(i\). The cells on the diagonal of the confusion matrix state the number of times the algorithm got the classification right. All other cells are instances of classification error. If an algorithm has no classification ability, then the rows and columns of the matrix will be independent of each other. Under this null hypothesis, the statistic that is examined for rejection is as follows:
\[ \chi^2[dof=(n-1)^2] = \sum_{i=1}^n \sum_{j=1}^n \frac{[A(i,j) - E(i,j)]^2}{E(i,j)} \]
where \(A(i,j)\) are the actual numbers observed in the confusion matrix, and \(E(i,j)\) are the expected numbers, assuming no classification ability under the null. If \(T(i)\) represents the total across row \(i\) of the confusion matrix, and \(T(j)\) the column total, then
\[ E(i,j) = \frac{T(i) \times T(j)}{\sum_{i=1}^n T(i)} \equiv \frac{T(i) \times T(j)}{\sum_{j=1}^n T(j)} \]
The degrees of freedom of the \(\chi^2\) statistic is \((n-1)^2\). This statistic is very easy to implement and may be applied to models for any \(n\). A highly significant statistic is evidence of classification ability.
Algorithm accuracy over a classification scheme is the percentage of text that is correctly classified. This may be done in-sample or out-of-sample. To compute this off the confusion matrix, we calculate
\[ \mbox{Accuracy} = \frac{ \sum_{i=1}^K O(i,i)}{\sum_{j=1}^K M(j)} = \frac{ \sum_{i=1}^K O(i,i)}{\sum_{i=1}^K M(i)} \]
We should hope that this is at least greater than \(1/K\), which is the accuracy level achieved on average from random guessing.
The percentage of false positives is a useful metric to work with. It may be calculated as a simple count or as a weighted count (by nearness of wrong category) of false classifications divided by total classifications undertaken.
For example, assume that in the example above, category 1 is BULLISH and category 3 is BEARISH, whereas category 2 is NEUTRAL. The false positives would arise from mis-classifying category 1 as 3 and vice-versa. We compute the false positive rate for illustration.
The false positive rate is just 1% in the example below.
Omatrix = matrix(c(22,1,0,3,44,3,1,1,25),3,3)
print((Omatrix[1,3]+Omatrix[3,1])/sum(Omatrix))## [1] 0.01
In a 3-way classification scheme, where category 1 is BULLISH and category 3 is BEARISH, whereas category 2 is NEUTRAL, we can compute this metric as follows.
\[ \begin{equation} \mbox{Sentiment Error} = 1 - \frac{M(j=1)-M(j=3)}{M(i=1)-M(i=3)} \nonumber \end{equation} \]
In our illustrative example, we may easily calculate this metric. The classified sentiment from the algorithm was \(-3 = 23-27\), whereas it actually should have been \(-2 = 26-28\). The percentage error in sentiment is 50%.
print(Omatrix)## [,1] [,2] [,3]
## [1,] 22 3 1
## [2,] 1 44 1
## [3,] 0 3 25
rsum = rowSums(Omatrix)
csum = colSums(Omatrix)
print(rsum)## [1] 26 46 28
print(csum)## [1] 23 50 27
print(1 - (-3)/(-2))## [1] -0.5
The metric uses the number of signed buys and sells in the day (based on a sentiment model) to determine how much difference of opinion there is in the market. The metric is computed as follows:
\[ \mbox{DISAG} = \left| 1 - \left| \frac{B-S}{B+S} \right| \right| \]
where \(B, S\) are the numbers of classified buys and sells. Note that DISAG is bounded between zero and one.
Using the true categories of buys (category 1 BULLISH) and sells (category 3 BEARISH) in the same example as before, we may compute disagreement. Since there is little agreement (26 buys and 28 sells), disagreement is high.
print(Omatrix)## [,1] [,2] [,3]
## [1,] 22 3 1
## [2,] 1 44 1
## [3,] 0 3 25
DISAG = abs(1-abs((26-28)/(26+28)))
print(DISAG)## [1] 0.962963
The creation of the confusion matrix leads naturally to two measures that are associated with it.
Precision is the fraction of positives identified that are truly positive, and is also known as positive predictive value. It is a measure of usefulness of prediction. So if the algorithm (say) was tasked with selecting those account holders on LinkedIn who are actually looking for a job, and it identifies \(n\) such people of which only \(m\) were really looking for a job, then the precision would be \(m/n\).
Recall is the proportion of positives that are correctly identified, and is also known as sensitivity. It is a measure of how complete the prediction is. If the actual number of people looking for a job on LinkedIn was \(M\), then recall would be \(n/M\).
For example, suppose we have the following confusion matrix.
| Actual | |||
|---|---|---|---|
| Predicted | Looking for Job | Not Looking | |
| Looking for Job | 10 | 2 | 12 |
| Not Looking | 1 | 16 | 17 |
| 11 | 18 | 29 |
In this case precision is \(10/12\) and recall is \(10/11\). Precision is related to the probability of false positives (Type I error), which is one minus precision. Recall is related to the probability of false negatives (Type II error), which is one minus recall.
This package bundles text classification algorithms into one package.
library(tm)
library(RTextTools)## Loading required package: SparseM
## Warning: package 'SparseM' was built under R version 3.2.5
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
##
## Attaching package: 'RTextTools'
## The following objects are masked from 'package:SnowballC':
##
## getStemLanguages, wordStem
#Create sample text with positive and negative markers
n = 1000
npos = round(runif(n,1,25))
nneg = round(runif(n,1,25))
flag = matrix(0,n,1)
flag[which(npos>nneg)] = 1
text = NULL
for (j in 1:n) {
res = paste(c(sample(poswords,npos[j]),sample(negwords,nneg[j])),collapse=" ")
text = c(text,res)
}
#Text Classification
m = create_matrix(text)
print(m)## <<DocumentTermMatrix (documents: 1000, terms: 3707)>>
## Non-/sparse entries: 25755/3681245
## Sparsity : 99%
## Maximal term length: 17
## Weighting : term frequency (tf)
m = create_matrix(text,weighting=weightTfIdf)
print(m)## <<DocumentTermMatrix (documents: 1000, terms: 3707)>>
## Non-/sparse entries: 25755/3681245
## Sparsity : 99%
## Maximal term length: 17
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
container <- create_container(m,flag,trainSize=1:(n/2), testSize=(n/2+1):n,virgin=FALSE)
#models <- train_models(container, algorithms=c("MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"))
models <- train_models(container, algorithms=c("MAXENT","SVM","GLMNET","TREE"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)
#RESULTS
analytics@algorithm_summary # SUMMARY OF PRECISION, RECALL, F-SCORES, AND ACCURACY SORTED BY TOPIC CODE FOR EACH ALGORITHM## SVM_PRECISION SVM_RECALL SVM_FSCORE GLMNET_PRECISION GLMNET_RECALL
## 0 0.8 0.82 0.81 0.59 0.78
## 1 0.8 0.78 0.79 0.64 0.41
## GLMNET_FSCORE TREE_PRECISION TREE_RECALL TREE_FSCORE
## 0 0.67 0.54 0.93 0.68
## 1 0.50 0.65 0.15 0.24
## MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 0 0.81 0.79 0.80
## 1 0.78 0.80 0.79
analytics@label_summary # SUMMARY OF LABEL (e.g. TOPIC) ACCURACY## NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 0 259 392 277
## 1 241 108 223
## PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 0 151.35135 106.94981 91.89189
## 1 44.81328 92.53112 36.09959
## PCT_CORRECTLY_CODED_PROBABILITY
## 0 79.53668
## 1 70.53942
analytics@document_summary # RAW SUMMARY OF ALL DATA AND SCORING## MAXENTROPY_LABEL MAXENTROPY_PROB SVM_LABEL SVM_PROB GLMNET_LABEL
## 1 1 0.8579293 1 0.8402757 0
## 2 0 0.9889092 0 0.7917711 0
## 3 1 0.8771239 1 0.9510173 1
## 4 1 0.7512495 1 0.5171318 0
## 5 1 0.6274731 1 0.7079830 0
## 6 1 0.6523223 0 0.5476286 0
## 7 1 0.7819357 1 0.5937867 0
## 8 0 0.9114306 0 0.8524521 0
## 9 0 0.9481654 0 0.7368287 0
## 10 0 0.9240892 0 0.7124702 0
## 11 0 0.5038318 0 0.6281052 0
## 12 1 0.7534973 1 0.5859398 1
## 13 1 0.5959040 0 0.8325437 0
## 14 0 0.8346023 0 0.7314251 0
## 15 0 0.9162477 0 0.5597547 0
## 16 0 0.9877611 0 0.9007317 0
## 17 0 0.8269503 0 0.5219729 0
## 18 0 0.8783712 0 0.5331234 0
## 19 0 0.5754839 1 0.6368268 0
## 20 1 0.6036500 1 0.6717878 0
## 21 1 0.9125837 1 0.6700221 0
## 22 1 0.5517872 0 0.5522943 0
## 23 1 0.9779073 1 0.8457107 1
## 24 0 0.5791141 0 0.5764542 0
## 25 1 0.9896532 1 0.9452910 1
## 26 0 0.8636291 0 0.6052753 0
## 27 0 0.9996259 0 0.9644365 0
## 28 0 0.7013461 0 0.7835032 1
## 29 0 0.7532735 0 0.6470991 0
## 30 1 0.6994149 1 0.5361563 0
## 31 0 0.9840143 0 0.6337562 1
## 32 1 0.9693421 1 0.8075273 1
## 33 1 0.9773058 1 0.9323526 1
## 34 1 0.9183425 1 0.6772988 0
## 35 1 0.9468948 1 0.7421286 1
## 36 0 0.9477714 0 0.8461861 0
## 37 1 0.6455413 0 0.5373315 0
## 38 1 0.9946571 1 0.6988711 0
## 39 1 0.9992862 1 0.9440295 0
## 40 0 0.9999305 0 0.9579143 0
## 41 1 0.9961413 1 0.8555121 0
## 42 0 0.9990200 0 0.9716997 0
## 43 1 0.7513958 0 0.5090049 0
## 44 0 0.9763378 0 0.7047927 1
## 45 0 0.8208638 0 0.7682049 0
## 46 1 0.5955903 0 0.5808784 0
## 47 0 0.9637975 0 0.9045674 0
## 48 1 0.9816336 1 0.8125144 1
## 49 1 0.8275467 1 0.6484182 1
## 50 0 0.9918852 0 0.8352544 0
## 51 0 0.8478272 0 0.6124079 0
## 52 1 0.6703351 1 0.5165522 0
## 53 0 0.5111346 0 0.6634197 0
## 54 0 0.5812117 0 0.6333654 0
## 55 1 0.9928267 1 0.8860294 0
## 56 1 0.7327049 1 0.5830705 0
## 57 0 0.9730584 0 0.8657668 0
## 58 0 0.9968959 0 0.9085251 1
## 59 0 0.5414529 0 0.5438450 1
## 60 1 0.8750862 1 0.5107863 0
## 61 1 0.7745912 1 0.6563022 1
## 62 1 0.8836069 1 0.7450693 1
## 63 0 0.8662524 1 0.5233994 0
## 64 1 0.9963765 1 0.8794826 1
## 65 1 0.9211440 1 0.8087564 1
## 66 0 0.9624085 0 0.7444767 1
## 67 1 0.5772304 0 0.6569568 0
## 68 0 0.9910539 0 0.9090005 0
## 69 0 0.9783718 0 0.8322591 0
## 70 0 0.7374812 0 0.6836114 0
## 71 1 0.9999973 1 0.9980898 1
## 72 1 0.8324743 1 0.5215142 0
## 73 1 0.6593330 1 0.6342903 0
## 74 1 0.9724665 1 0.5740967 1
## 75 0 0.7777847 0 0.6719141 0
## 76 0 0.9615039 0 0.8009526 0
## 77 0 0.8134608 1 0.5497218 1
## 78 1 0.8454406 1 0.5883797 0
## 79 1 0.9479331 0 0.5360295 1
## 80 1 0.9230795 1 0.8210145 0
## 81 0 0.9203069 0 0.6967228 0
## 82 1 0.8738333 1 0.7200714 1
## 83 0 0.7767262 0 0.6966733 1
## 84 1 0.6662228 1 0.6142579 0
## 85 0 0.9558573 0 0.8713604 0
## 86 0 0.5209858 0 0.6368313 0
## 87 0 0.9834270 0 0.6966126 0
## 88 0 0.5039243 1 0.5810159 1
## 89 0 0.6592179 0 0.5473700 0
## 90 1 0.8206025 0 0.5857905 1
## 91 0 0.9485201 0 0.7013216 0
## 92 0 0.9788151 0 0.7388368 0
## 93 1 0.9157100 1 0.7152854 0
## 94 1 0.9132725 1 0.7660061 1
## 95 1 0.9358443 1 0.7086191 0
## 96 1 0.5151482 0 0.5249791 0
## 97 1 0.5463302 0 0.6979842 0
## 98 1 0.6657931 0 0.5744932 0
## 99 1 0.9573048 1 0.5617906 0
## 100 1 0.5914636 1 0.5875220 1
## 101 0 0.9937141 0 0.9468874 0
## 102 1 0.9889676 1 0.8175819 1
## 103 1 0.6751737 1 0.6128321 1
## 104 0 0.5129555 1 0.5417314 1
## 105 1 0.7528127 1 0.7529493 0
## 106 1 0.9946818 1 0.9455856 1
## 107 0 0.6762017 0 0.6156119 1
## 108 0 0.7763997 0 0.5316527 0
## 109 0 0.9818108 0 0.7262813 0
## 110 1 0.8674041 1 0.7825652 0
## 111 0 0.7478253 0 0.6084569 0
## 112 1 0.9836338 1 0.8543653 1
## 113 0 0.9634648 0 0.9359335 0
## 114 0 0.8182412 0 0.8653513 0
## 115 1 0.7047897 0 0.5145153 0
## 116 1 0.9200210 1 0.7607339 0
## 117 0 0.9933144 0 0.9131847 0
## 118 0 0.9025121 0 0.6184797 0
## 119 1 0.9937498 1 0.9080192 1
## 120 1 0.5402749 0 0.5658089 0
## 121 0 0.9753441 0 0.9230138 0
## 122 1 0.9706926 1 0.5247743 0
## 123 0 0.6742449 0 0.7398689 0
## 124 0 0.8742186 0 0.7326573 0
## 125 0 0.5335593 1 0.5415353 0
## 126 0 0.8910276 0 0.8034074 0
## 127 0 0.9934759 0 0.9413915 0
## 128 0 0.9770633 0 0.8111901 0
## 129 0 0.9974013 0 0.9472345 0
## 130 0 0.9927454 0 0.9288016 0
## 131 1 0.9924156 1 0.9112208 0
## 132 0 0.9762022 0 0.8702204 0
## 133 1 0.9753263 1 0.6592442 0
## 134 1 0.9999129 1 0.9328961 1
## 135 1 0.5528284 0 0.6754797 1
## 136 1 0.7963306 1 0.7210398 0
## 137 0 0.7610011 1 0.5000000 0
## 138 1 0.9552446 1 0.8322440 1
## 139 1 0.8913725 1 0.6311903 0
## 140 0 0.9564910 0 0.7348463 0
## 141 0 0.9975408 0 0.8700072 0
## 142 1 0.9429426 1 0.7539096 0
## 143 0 0.9763186 0 0.8522256 0
## 144 0 0.9504727 0 0.8119919 0
## 145 0 0.6506800 0 0.7310967 0
## 146 0 0.6387208 0 0.5352087 0
## 147 0 0.9072185 0 0.7639297 0
## 148 0 0.7050164 0 0.6584521 0
## 149 0 0.9671963 0 0.8330123 1
## 150 1 0.8606746 0 0.6428315 0
## 151 1 0.9778635 1 0.7972212 0
## 152 0 0.8921047 1 0.6049122 1
## 153 0 0.7423709 0 0.5317705 1
## 154 0 0.9402686 0 0.8672841 0
## 155 0 0.6657072 0 0.7161006 0
## 156 0 0.9992692 0 0.9448097 0
## 157 1 0.7233352 1 0.5800949 1
## 158 1 0.9985942 1 0.9708887 0
## 159 0 0.6700544 0 0.7031450 0
## 160 1 0.9872401 1 0.6853467 0
## 161 0 0.7979641 0 0.6832418 0
## 162 0 0.7472300 0 0.6975315 1
## 163 0 0.5928332 0 0.7230411 0
## 164 0 0.9837886 0 0.8250189 0
## 165 0 0.6306148 1 0.6488264 1
## 166 1 0.9104182 1 0.6929846 0
## 167 0 0.9316773 0 0.8365241 0
## 168 0 0.8405723 0 0.6611237 0
## 169 0 0.9565921 0 0.7508357 0
## 170 0 0.7110081 0 0.6531900 1
## 171 1 0.7560378 1 0.7902000 1
## 172 0 0.9035350 0 0.5478064 0
## 173 1 0.9217055 1 0.7377317 0
## 174 0 0.9724004 0 0.7529122 0
## 175 1 0.7339565 1 0.6350940 0
## 176 1 0.9478130 1 0.7114427 1
## 177 0 0.9935900 0 0.8726856 0
## 178 1 0.6289705 1 0.5480000 0
## 179 0 0.5311258 0 0.5879138 0
## 180 1 0.9925060 1 0.8921324 1
## 181 1 0.9137843 1 0.7270650 0
## 182 0 0.9519986 0 0.8810549 0
## 183 0 0.7883762 0 0.5468736 1
## 184 1 0.9081567 1 0.7568435 1
## 185 0 0.8683790 0 0.5569881 0
## 186 0 0.7008180 0 0.5625275 0
## 187 1 0.9973144 1 0.9299310 0
## 188 0 0.9519957 0 0.9115386 0
## 189 0 0.5611704 1 0.5000000 0
## 190 1 0.7220722 1 0.6971641 1
## 191 0 0.6198362 0 0.5935343 0
## 192 1 0.9846100 1 0.8471466 1
## 193 0 0.9327107 0 0.7992874 0
## 194 0 0.7101888 0 0.5334756 0
## 195 1 0.7648032 1 0.6125674 1
## 196 1 0.9854055 1 0.7351516 1
## 197 1 0.9762002 1 0.8398882 0
## 198 0 0.9979624 0 0.9370254 0
## 199 0 0.6803242 0 0.6342106 0
## 200 0 0.6705083 1 0.5242053 1
## 201 0 0.7522277 0 0.5556011 0
## 202 1 0.6863030 1 0.5176115 0
## 203 1 0.6580211 1 0.5716186 0
## 204 0 0.8634015 0 0.8077151 0
## 205 1 0.8436331 1 0.6238419 0
## 206 1 0.9683306 1 0.8070274 0
## 207 1 0.9522856 1 0.6924892 1
## 208 1 0.7720141 1 0.6598490 0
## 209 1 0.9958683 1 0.8699881 0
## 210 0 0.9862749 0 0.8041823 0
## 211 0 0.5969224 1 0.7062762 1
## 212 0 0.9336405 0 0.7558863 0
## 213 1 0.9977380 1 0.9286261 0
## 214 0 0.9860395 0 0.9325511 0
## 215 0 0.9977886 0 0.9420480 0
## 216 0 0.9401983 0 0.8076000 0
## 217 1 0.9013253 1 0.6563627 0
## 218 0 0.5187154 0 0.5842494 0
## 219 0 0.6690334 1 0.6339251 0
## 220 0 0.9456176 0 0.8697507 0
## 221 1 0.8586761 1 0.5529408 1
## 222 0 0.8862194 0 0.7476264 1
## 223 1 0.9225414 1 0.5213904 0
## 224 1 0.7841996 1 0.5323519 0
## 225 0 0.9991583 0 0.9481004 1
## 226 1 0.9304379 1 0.6965753 0
## 227 0 0.5710796 0 0.5286048 0
## 228 1 0.8308168 1 0.7026651 1
## 229 1 0.8601532 1 0.5515215 1
## 230 1 0.9674282 1 0.7099447 0
## 231 0 0.7017201 0 0.7500263 1
## 232 0 0.5223424 0 0.5729081 0
## 233 0 0.9121292 0 0.5948308 0
## 234 0 0.9299358 0 0.6050689 1
## 235 0 0.9795657 0 0.8375783 0
## 236 1 0.9461373 1 0.7772152 0
## 237 1 0.8068785 1 0.5183433 0
## 238 0 0.9939132 0 0.9481846 0
## 239 1 0.9950434 1 0.9235710 0
## 240 1 0.9115514 0 0.5080233 0
## 241 1 0.5365909 1 0.6667368 1
## 242 0 0.9988258 0 0.9395797 0
## 243 0 0.9992458 0 0.9331821 1
## 244 1 0.8956701 1 0.5742082 0
## 245 0 0.8813561 0 0.6236588 0
## 246 1 0.7937640 1 0.7213871 1
## 247 0 0.9238146 0 0.6728945 0
## 248 1 0.9996548 1 0.9581935 1
## 249 0 0.9871408 0 0.8609307 0
## 250 1 0.9993370 1 0.9383927 1
## 251 0 0.9356324 0 0.7329846 0
## 252 1 0.9840480 1 0.8597488 0
## 253 0 0.9869608 0 0.9131760 0
## 254 1 0.9982424 1 0.8869734 0
## 255 0 0.9138004 0 0.6861681 0
## 256 0 0.6858577 1 0.5622951 1
## 257 0 0.5985646 0 0.6565838 0
## 258 1 0.9710643 1 0.7977791 0
## 259 1 0.9589746 1 0.7922633 1
## 260 1 0.9549443 1 0.7806370 0
## 261 0 0.6918860 0 0.5766865 0
## 262 0 0.9985237 0 0.8982357 0
## 263 1 0.8399776 1 0.6923087 1
## 264 0 0.9915302 0 0.7551562 0
## 265 1 0.9956326 1 0.8826648 0
## 266 1 0.9547358 1 0.8752002 0
## 267 1 0.8960485 1 0.7327110 1
## 268 1 0.6222101 0 0.5769813 1
## 269 1 0.9995143 1 0.9429565 1
## 270 1 0.9027375 1 0.5546629 0
## 271 0 0.9819313 0 0.8049833 0
## 272 1 0.9201509 1 0.8614444 1
## 273 0 0.9984833 0 0.6930337 0
## 274 1 0.6100678 0 0.5414032 1
## 275 1 0.7803396 1 0.5175333 0
## 276 0 0.8704907 0 0.6726541 0
## 277 0 0.8580918 0 0.8068547 0
## 278 1 0.9967249 1 0.9028317 1
## 279 0 0.6276587 0 0.6460706 1
## 280 0 0.8532253 0 0.7217786 0
## 281 0 0.7642552 0 0.7148068 0
## 282 1 0.8410749 1 0.7090440 1
## 283 1 0.5896253 1 0.6433853 1
## 284 0 0.8554587 0 0.8267814 0
## 285 0 0.9893251 0 0.8502544 0
## 286 1 0.9871357 1 0.8239922 0
## 287 1 0.9902735 1 0.9128083 1
## 288 0 0.9985652 0 0.9326216 1
## 289 1 0.9915358 1 0.8658739 0
## 290 1 0.9997295 1 0.9252811 0
## 291 1 0.6164611 1 0.6474987 1
## 292 1 0.9053614 1 0.7346250 1
## 293 1 0.8422158 1 0.5884936 1
## 294 0 0.9114742 0 0.6671633 0
## 295 0 0.9984413 0 0.9119299 0
## 296 0 0.8707787 0 0.7631776 0
## 297 0 0.5836651 1 0.6607701 0
## 298 1 0.9860063 1 0.8541644 1
## 299 1 0.6162486 1 0.6223337 0
## 300 0 0.8305191 0 0.6640235 0
## 301 0 0.9427569 0 0.5294931 0
## 302 1 0.9609790 1 0.7429972 0
## 303 1 0.9739832 1 0.6503368 0
## 304 1 0.9471382 1 0.7350835 0
## 305 1 0.6543756 0 0.5728064 1
## 306 0 0.9122146 0 0.8051103 0
## 307 1 0.8238401 1 0.5624106 0
## 308 1 0.7811818 1 0.5548059 0
## 309 0 0.8406514 0 0.7738493 0
## 310 0 0.9457159 0 0.6870276 1
## 311 0 0.7998331 0 0.5510085 0
## 312 1 0.9992647 1 0.9398967 1
## 313 0 0.9079526 0 0.6203149 0
## 314 0 0.8368040 0 0.6272342 1
## 315 1 0.9925879 1 0.9578279 1
## 316 0 0.6228280 0 0.6482964 0
## 317 0 0.6739993 0 0.6757330 1
## 318 1 0.5705777 1 0.6939397 1
## 319 1 0.8237647 1 0.5931125 0
## 320 0 0.8342470 0 0.6562285 0
## 321 0 0.9322149 0 0.7823699 0
## 322 0 0.8560471 0 0.7969113 0
## 323 1 0.7914854 0 0.5084336 0
## 324 1 0.5482127 1 0.6354649 0
## 325 1 0.9986399 1 0.8703626 0
## 326 1 0.7073741 1 0.5423448 0
## 327 0 0.6749616 1 0.6179744 1
## 328 1 0.9690866 1 0.7431435 0
## 329 1 0.8789130 1 0.7282262 1
## 330 0 0.9014785 0 0.5646138 0
## 331 1 0.8707241 1 0.7249503 1
## 332 1 0.8597746 1 0.7923170 0
## 333 1 0.8215223 1 0.5616467 1
## 334 1 0.9961448 1 0.9181635 1
## 335 0 0.9988321 0 0.9423715 0
## 336 1 0.5135696 1 0.5000000 0
## 337 0 0.9899112 0 0.9073483 1
## 338 0 0.8359537 0 0.6753093 0
## 339 1 0.9705366 1 0.8288985 1
## 340 1 0.9943038 1 0.6951728 0
## 341 1 0.9163763 1 0.6460686 1
## 342 1 0.6998493 1 0.7422305 0
## 343 0 0.6632546 0 0.7485119 0
## 344 0 0.9609904 0 0.5989242 1
## 345 1 0.7437858 1 0.6102799 0
## 346 1 0.8466107 1 0.6968808 1
## 347 0 0.5543564 0 0.6482536 0
## 348 1 0.9804986 1 0.6492958 0
## 349 0 0.8736247 0 0.6875652 0
## 350 1 0.6728316 0 0.5712710 0
## 351 0 0.9860867 0 0.6926137 0
## 352 1 0.9974535 1 0.9727068 1
## 353 1 0.7442008 0 0.5370067 0
## 354 0 0.8777880 0 0.6573623 0
## 355 1 0.9973117 1 0.8947501 0
## 356 1 0.7015934 1 0.7585121 0
## 357 0 0.6393297 0 0.7159780 0
## 358 1 0.7595052 1 0.5139238 1
## 359 0 0.7916158 0 0.6007815 0
## 360 0 0.6808300 0 0.5993908 0
## 361 1 0.7723349 1 0.6779353 1
## 362 1 0.9894503 1 0.7380638 1
## 363 1 0.9896806 1 0.8968396 1
## 364 1 0.7789526 1 0.6853853 1
## 365 0 0.9965698 0 0.9042862 0
## 366 1 0.8788495 1 0.7651889 0
## 367 1 0.9660800 1 0.9385775 1
## 368 1 0.8342565 0 0.5669056 1
## 369 0 0.8947800 0 0.7513126 1
## 370 0 0.5025263 0 0.6584950 1
## 371 0 0.8104135 0 0.7068002 0
## 372 0 0.8218882 0 0.7668281 0
## 373 0 0.8226753 0 0.7980549 0
## 374 1 0.5052631 0 0.5439073 0
## 375 1 0.9982171 1 0.9412506 0
## 376 0 0.9540871 0 0.6988135 0
## 377 0 0.9926865 0 0.9181457 0
## 378 1 0.6010322 0 0.5373245 0
## 379 0 0.7856743 0 0.5208672 1
## 380 1 0.8718875 1 0.8180463 0
## 381 1 0.6489393 0 0.5290149 0
## 382 1 0.9894129 1 0.8094954 1
## 383 1 0.9796525 1 0.8596787 1
## 384 0 0.9979607 0 0.9386429 0
## 385 1 0.8638297 1 0.8101401 0
## 386 1 0.5546006 1 0.6919865 1
## 387 0 0.8180760 0 0.7012383 0
## 388 0 0.9876469 0 0.9284343 0
## 389 1 0.9823872 1 0.9204731 0
## 390 0 0.8905893 0 0.7539698 0
## 391 1 0.7984151 1 0.5000000 1
## 392 0 0.9665107 0 0.6980406 0
## 393 1 0.9958336 1 0.8410243 0
## 394 0 0.6541532 0 0.5512484 0
## 395 0 0.9948952 0 0.8221016 0
## 396 0 0.7942886 0 0.8116594 0
## 397 0 0.7587738 1 0.6051349 1
## 398 0 0.9983465 0 0.9285883 0
## 399 1 0.9571553 1 0.8026093 0
## 400 1 0.9419521 1 0.8227311 1
## 401 0 0.9985936 0 0.9140994 0
## 402 0 0.9997320 0 0.9181361 0
## 403 0 0.8044088 0 0.7147719 0
## 404 1 0.8025932 1 0.9296026 0
## 405 1 0.6334948 1 0.6185342 1
## 406 1 0.9719801 1 0.8781701 1
## 407 1 0.5331207 1 0.6234584 1
## 408 1 0.9990193 1 0.9470882 1
## 409 0 0.8305089 0 0.7008554 1
## 410 0 0.8066429 0 0.8059884 0
## 411 1 0.8025757 1 0.6282915 1
## 412 1 0.6822269 0 0.6266570 0
## 413 1 0.9465322 1 0.9194491 1
## 414 0 0.8347914 0 0.6620772 0
## 415 1 0.8162886 1 0.9187027 0
## 416 0 0.9984532 0 0.8820015 0
## 417 1 0.5484758 0 0.5182158 0
## 418 1 0.7904369 1 0.6055508 0
## 419 0 0.6879972 1 0.5000000 1
## 420 1 0.9981277 1 0.9228279 1
## 421 0 0.7664426 0 0.5344050 0
## 422 0 0.9950190 0 0.8106605 0
## 423 1 0.9737793 1 0.7264617 1
## 424 0 0.9785975 0 0.8043745 0
## 425 0 0.9365236 0 0.7434777 0
## 426 0 0.8905600 0 0.8009632 0
## 427 0 0.9963515 0 0.9026182 0
## 428 0 0.9676527 0 0.7838589 0
## 429 1 0.9449964 0 0.5430453 0
## 430 0 0.9994865 0 0.9541793 0
## 431 0 0.6503608 0 0.6474523 0
## 432 0 0.6319210 0 0.5158716 0
## 433 1 0.8570130 1 0.5473283 1
## 434 0 0.9999995 0 0.9771884 0
## 435 1 0.9753302 1 0.8168514 1
## 436 1 0.5729345 1 0.5208862 0
## 437 0 0.9087493 0 0.8208061 0
## 438 1 0.9892250 1 0.8500823 0
## 439 1 0.8351191 1 0.6783053 0
## 440 1 0.9987029 1 0.9556035 1
## 441 1 0.8200236 1 0.7363374 1
## 442 0 0.7801902 0 0.8580455 0
## 443 1 0.9913303 1 0.8545664 0
## 444 0 0.5413476 1 0.6780044 1
## 445 0 0.9786135 0 0.8499764 0
## 446 0 0.9779005 0 0.8622694 0
## 447 0 0.5845589 0 0.5688480 1
## 448 1 0.6262259 0 0.5478604 0
## 449 1 0.8167463 1 0.5157745 0
## 450 1 0.7324551 1 0.5611828 1
## 451 0 0.9832145 0 0.8103501 0
## 452 0 0.7652720 0 0.7460515 1
## 453 1 0.5011543 0 0.5673300 0
## 454 1 0.9669596 1 0.8856077 1
## 455 0 0.9743722 0 0.9112886 1
## 456 1 0.7581967 0 0.5583164 0
## 457 1 0.9604072 1 0.8664372 1
## 458 0 0.9798291 0 0.7953195 0
## 459 1 0.9620615 1 0.8356544 1
## 460 0 0.9930181 0 0.8928142 0
## 461 0 0.6809998 1 0.7355854 0
## 462 1 0.6170456 1 0.6238354 0
## 463 0 0.6961607 1 0.5948026 1
## 464 1 0.9388029 1 0.7135307 0
## 465 1 0.5362272 1 0.5677262 1
## 466 1 0.9577047 1 0.8808262 1
## 467 1 0.8981484 1 0.5280032 0
## 468 0 0.6333211 0 0.6085719 0
## 469 0 0.8817043 0 0.6362156 0
## 470 0 0.9238678 0 0.8846742 0
## 471 1 0.5239956 0 0.7341881 0
## 472 1 0.5646487 1 0.6400849 1
## 473 0 0.7097329 0 0.7561265 0
## 474 0 0.7986456 1 0.5145433 1
## 475 1 0.7309270 0 0.5099924 0
## 476 0 0.8997366 0 0.8586327 0
## 477 0 0.9958991 0 0.9417519 0
## 478 0 0.9480399 0 0.7875715 0
## 479 0 0.9999734 0 0.9888995 0
## 480 0 0.5034669 0 0.6431503 0
## 481 1 0.9935161 1 0.9020440 1
## 482 1 0.9875672 1 0.7640271 0
## 483 0 0.8762831 1 0.6589625 1
## 484 1 0.9650693 1 0.6594522 1
## 485 0 0.8171540 0 0.5850818 1
## 486 1 0.8879084 1 0.7838821 0
## 487 1 0.9993119 1 0.9492154 1
## 488 0 0.9934395 0 0.6948757 0
## 489 0 0.8499537 0 0.6463377 0
## 490 0 0.9394478 0 0.7994531 0
## 491 0 0.9952418 0 0.8408268 0
## 492 1 0.8882416 1 0.7178416 0
## 493 1 0.9490260 1 0.9046973 1
## 494 1 0.5261529 0 0.5234034 0
## 495 1 0.8336614 1 0.5965328 0
## 496 0 0.7470037 0 0.7062910 0
## 497 0 0.9827006 0 0.8020107 0
## 498 1 0.7805598 1 0.6247937 0
## 499 0 0.7382974 0 0.7111972 0
## 500 0 0.9515785 0 0.8849098 0
## GLMNET_PROB TREE_LABEL TREE_PROB MANUAL_CODE CONSENSUS_CODE
## 1 0.5288133 0 0.6616541 0 0
## 2 0.8873376 0 0.6616541 0 0
## 3 0.9923295 1 1.0000000 1 1
## 4 0.8721474 0 0.6616541 0 0
## 5 0.8780568 0 0.6616541 1 0
## 6 0.8746519 0 0.6616541 1 0
## 7 0.8472148 0 0.6616541 1 0
## 8 0.7968811 0 0.6616541 0 0
## 9 0.8348502 0 0.6616541 0 0
## 10 0.8513187 0 0.6616541 0 0
## 11 0.8746533 0 0.6616541 0 0
## 12 0.6177916 1 1.0000000 0 1
## 13 0.7382742 0 0.6616541 0 0
## 14 0.7293119 0 0.6616541 0 0
## 15 0.9775675 0 0.6616541 0 0
## 16 0.9377424 0 0.6616541 0 0
## 17 0.8959542 0 0.6616541 1 0
## 18 0.6561360 0 0.6616541 0 0
## 19 0.9487429 0 0.6616541 1 0
## 20 0.6793379 0 0.6616541 1 0
## 21 0.7873755 0 0.6616541 1 0
## 22 0.7772535 0 0.6616541 0 0
## 23 0.9821741 1 1.0000000 1 1
## 24 0.6879190 0 0.6616541 1 0
## 25 0.9981822 1 1.0000000 1 1
## 26 0.8016019 0 0.6616541 0 0
## 27 0.9531810 0 0.6616541 0 0
## 28 0.8895098 0 0.6616541 0 0
## 29 0.8746534 0 0.6616541 0 0
## 30 0.8746534 0 0.6616541 0 0
## 31 0.5653546 0 0.6616541 0 0
## 32 0.9850735 0 0.6616541 1 1
## 33 0.9947636 1 1.0000000 1 1
## 34 0.5936176 0 0.6616541 1 0
## 35 0.9496439 0 0.6616541 1 1
## 36 0.7859645 0 0.6616541 0 0
## 37 0.7668888 0 0.6616541 1 0
## 38 0.5687664 0 0.6616541 1 0
## 39 0.8746534 0 0.6616541 1 0
## 40 0.7332054 0 0.6616541 0 0
## 41 0.6844681 0 0.6616541 1 0
## 42 0.8746534 0 0.6616541 0 0
## 43 0.5468797 0 0.6616541 1 0
## 44 0.9154582 0 0.6616541 0 0
## 45 0.8876478 0 0.6616541 0 0
## 46 0.8746534 0 0.6616541 1 0
## 47 0.9124036 0 0.6616541 0 0
## 48 0.9392254 0 0.6616541 1 1
## 49 0.9269598 0 0.6616541 0 1
## 50 0.8746534 0 0.6616541 0 0
## 51 0.7311568 0 0.6616541 1 0
## 52 0.6349322 0 0.6616541 0 0
## 53 0.8746534 0 0.6616541 1 0
## 54 0.5961871 0 0.6616541 1 0
## 55 0.5020932 0 0.6616541 1 0
## 56 0.9772314 0 0.6616541 1 0
## 57 0.8864575 0 0.6616541 0 0
## 58 0.9689157 0 0.6616541 0 0
## 59 0.8455672 0 0.6616541 1 0
## 60 0.9221623 0 0.6616541 0 0
## 61 0.9401425 1 1.0000000 1 1
## 62 0.7596314 1 1.0000000 1 1
## 63 0.9481777 0 0.6616541 1 0
## 64 0.8586796 0 0.6616541 1 1
## 65 0.8066477 1 1.0000000 1 1
## 66 0.6352883 0 0.6616541 1 0
## 67 0.9089932 0 0.6616541 0 0
## 68 0.8834282 0 0.6616541 0 0
## 69 0.7074154 0 0.6616541 0 0
## 70 0.8832441 0 0.6616541 0 0
## 71 0.9995702 1 1.0000000 1 1
## 72 0.8214399 0 0.6616541 0 0
## 73 0.7481269 0 0.6616541 1 0
## 74 0.9770322 1 1.0000000 0 1
## 75 0.5162810 0 0.6616541 0 0
## 76 0.8498646 0 0.6616541 0 0
## 77 0.7032368 0 0.6616541 0 0
## 78 0.8746534 0 0.6616541 1 0
## 79 0.9983300 0 0.6616541 0 0
## 80 0.8746534 0 0.6616541 1 0
## 81 0.8746534 0 0.6616541 0 0
## 82 0.6610544 1 1.0000000 1 1
## 83 0.6091501 1 1.0000000 0 0
## 84 0.5713303 0 0.6616541 0 0
## 85 0.8746534 0 0.6616541 0 0
## 86 0.5786507 0 0.6616541 1 0
## 87 0.8746534 0 0.6616541 0 0
## 88 0.5827670 0 0.6616541 0 0
## 89 0.9021534 0 0.6616541 0 0
## 90 0.5226191 0 0.6616541 0 0
## 91 0.5859559 0 0.6616541 1 0
## 92 0.8746534 0 0.6616541 0 0
## 93 0.7903308 0 0.6616541 1 0
## 94 0.5182800 0 0.6616541 1 1
## 95 0.5841865 0 0.6616541 1 0
## 96 0.6364076 0 0.6616541 0 0
## 97 0.8746534 0 0.6616541 1 0
## 98 0.8746534 0 0.6616541 0 0
## 99 0.8746534 0 0.6616541 0 0
## 100 0.6542612 0 0.6616541 1 1
## 101 0.9019525 0 0.6616541 0 0
## 102 0.9706423 0 0.6616541 1 1
## 103 0.5619032 0 0.6616541 1 1
## 104 0.5616499 1 1.0000000 0 1
## 105 0.5606976 0 0.6616541 1 0
## 106 0.6709817 0 0.6616541 1 1
## 107 0.6667234 0 0.6616541 1 0
## 108 0.7156138 0 0.6616541 0 0
## 109 0.7076696 0 0.6616541 0 0
## 110 0.8312640 0 0.6616541 1 0
## 111 0.8595656 0 0.6616541 0 0
## 112 0.8199301 1 1.0000000 1 1
## 113 0.8874999 0 0.6616541 0 0
## 114 0.8568738 0 0.6616541 0 0
## 115 0.5106395 0 0.6616541 1 0
## 116 0.7257786 0 0.6616541 1 0
## 117 0.8327206 0 0.6616541 0 0
## 118 0.6217435 0 0.6616541 1 0
## 119 0.9819054 1 1.0000000 1 1
## 120 0.7668394 0 0.6616541 1 0
## 121 0.7550201 0 0.6616541 0 0
## 122 0.8746534 0 0.6616541 1 0
## 123 0.8746534 0 0.6616541 0 0
## 124 0.8601919 0 0.6616541 0 0
## 125 0.6315163 1 1.0000000 1 0
## 126 0.8746534 0 0.6616541 0 0
## 127 0.8746534 0 0.6616541 0 0
## 128 0.8952274 0 0.6616541 1 0
## 129 0.9068703 0 0.6616541 0 0
## 130 0.8986318 0 0.6616541 0 0
## 131 0.8746534 0 0.6616541 1 0
## 132 0.8904833 0 0.6616541 0 0
## 133 0.7339450 0 0.6616541 1 0
## 134 0.9369540 0 0.6616541 1 1
## 135 0.6844138 0 0.6616541 0 0
## 136 0.8384454 0 0.6616541 0 0
## 137 0.8658935 0 0.6616541 0 0
## 138 0.8769053 0 0.6616541 0 1
## 139 0.6032253 0 0.6616541 1 0
## 140 0.6921802 0 0.6616541 0 0
## 141 0.8746534 0 0.6616541 0 0
## 142 0.8746534 0 0.6616541 1 0
## 143 0.7434196 0 0.6616541 0 0
## 144 0.8562146 0 0.6616541 0 0
## 145 0.7724615 0 0.6616541 0 0
## 146 0.8864544 0 0.6616541 1 0
## 147 0.7059566 0 0.6616541 0 0
## 148 0.8594154 0 0.6616541 0 0
## 149 0.6600620 0 0.6616541 0 0
## 150 0.7747498 0 0.6616541 1 0
## 151 0.8746534 0 0.6616541 1 0
## 152 0.9131066 0 0.6616541 1 0
## 153 0.6645777 1 1.0000000 0 0
## 154 0.8969748 0 0.6616541 0 0
## 155 0.8556519 0 0.6616541 1 0
## 156 0.9166776 0 0.6616541 0 0
## 157 0.5922849 0 0.6616541 1 1
## 158 0.5038534 0 0.6616541 1 0
## 159 0.6181674 0 0.6616541 0 0
## 160 0.6665485 0 0.6616541 1 0
## 161 0.8746534 0 0.6616541 0 0
## 162 0.7137833 0 0.6616541 0 0
## 163 0.9443730 0 0.6616541 0 0
## 164 0.8403756 0 0.6616541 0 0
## 165 0.8746468 1 1.0000000 1 1
## 166 0.8281456 0 0.6616541 1 0
## 167 0.8746534 0 0.6616541 0 0
## 168 0.7565338 0 0.6616541 0 0
## 169 0.8953264 0 0.6616541 0 0
## 170 0.6682192 0 0.6616541 1 0
## 171 0.8900256 1 1.0000000 1 1
## 172 0.6173225 0 0.6616541 1 0
## 173 0.8283956 0 0.6616541 1 0
## 174 0.8986282 0 0.6616541 0 0
## 175 0.5171646 0 0.6616541 1 0
## 176 0.6286292 1 1.0000000 0 1
## 177 0.9104825 0 0.6616541 0 0
## 178 0.9464180 0 0.6616541 1 0
## 179 0.8885275 0 0.6616541 1 0
## 180 0.8207597 0 0.6616541 1 1
## 181 0.6099646 0 0.6616541 1 0
## 182 0.7626637 0 0.6616541 0 0
## 183 0.6563896 0 0.6616541 1 0
## 184 0.8409868 0 0.6616541 1 1
## 185 0.5293677 0 0.6616541 0 0
## 186 0.9162461 0 0.6616541 0 0
## 187 0.8746534 0 0.6616541 1 0
## 188 0.8910004 0 0.6616541 0 0
## 189 0.8273065 0 0.6616541 0 0
## 190 0.9281266 0 0.6616541 1 1
## 191 0.5024032 0 0.6616541 0 0
## 192 0.5553232 0 0.6616541 1 1
## 193 0.8746534 0 0.6616541 0 0
## 194 0.8749611 0 0.6616541 1 0
## 195 0.5410156 0 0.6616541 1 1
## 196 0.8014755 0 0.6616541 1 1
## 197 0.6256132 0 0.6616541 1 0
## 198 0.9097767 0 0.6616541 0 0
## 199 0.6772744 1 1.0000000 0 0
## 200 0.9271029 0 0.6616541 0 0
## 201 0.8765775 0 0.6616541 1 0
## 202 0.8847603 0 0.6616541 0 0
## 203 0.5699222 0 0.6616541 0 0
## 204 0.8222035 0 0.6616541 0 0
## 205 0.8746534 0 0.6616541 1 0
## 206 0.8746534 0 0.6616541 1 0
## 207 0.8500820 0 0.6616541 1 1
## 208 0.6184018 0 0.6616541 1 0
## 209 0.7305457 0 0.6616541 1 0
## 210 0.5376464 0 0.6616541 0 0
## 211 0.9535965 0 0.6616541 1 0
## 212 0.6531240 0 0.6616541 0 0
## 213 0.7373271 0 0.6616541 1 0
## 214 0.6491119 0 0.6616541 0 0
## 215 0.9286990 0 0.6616541 0 0
## 216 0.8746534 0 0.6616541 0 0
## 217 0.7704956 0 0.6616541 1 0
## 218 0.7749730 0 0.6616541 0 0
## 219 0.7226916 0 0.6616541 1 0
## 220 0.8746534 0 0.6616541 0 0
## 221 0.7775379 0 0.6616541 1 1
## 222 0.5794264 0 0.6616541 1 0
## 223 0.6628768 0 0.6616541 1 0
## 224 0.6536263 0 0.6616541 1 0
## 225 0.9037813 0 0.6616541 0 0
## 226 0.8746534 0 0.6616541 1 0
## 227 0.9114148 0 0.6616541 0 0
## 228 0.6858137 0 0.6616541 0 1
## 229 0.7137479 0 0.6616541 1 1
## 230 0.5809665 0 0.6616541 1 0
## 231 0.8387342 1 1.0000000 0 0
## 232 0.7267735 0 0.6616541 0 0
## 233 0.7618806 0 0.6616541 0 0
## 234 0.8129286 0 0.6616541 0 0
## 235 0.9337398 0 0.6616541 0 0
## 236 0.8040740 0 0.6616541 1 0
## 237 0.8176373 0 0.6616541 1 0
## 238 0.6446767 0 0.6616541 0 0
## 239 0.8746534 0 0.6616541 1 0
## 240 0.8746534 0 0.6616541 0 0
## 241 0.6408152 0 0.6616541 0 1
## 242 0.8746534 0 0.6616541 0 0
## 243 0.9689157 0 0.6616541 0 0
## 244 0.6692096 0 0.6616541 1 0
## 245 0.8800186 0 0.6616541 0 0
## 246 0.9967607 1 1.0000000 1 1
## 247 0.6000537 0 0.6616541 1 0
## 248 0.9940200 0 0.6616541 1 1
## 249 0.8264154 0 0.6616541 0 0
## 250 0.9979586 1 1.0000000 1 1
## 251 0.8746534 0 0.6616541 0 0
## 252 0.6779741 0 0.6616541 0 0
## 253 0.9199594 0 0.6616541 0 0
## 254 0.6814197 0 0.6616541 1 0
## 255 0.5887814 0 0.6616541 1 0
## 256 0.9931163 0 0.6616541 1 0
## 257 0.8004220 0 0.6616541 0 0
## 258 0.8062138 0 0.6616541 1 0
## 259 0.9982997 0 0.6616541 1 1
## 260 0.7225246 0 0.6616541 1 0
## 261 0.9200165 0 0.6616541 1 0
## 262 0.8746534 0 0.6616541 0 0
## 263 0.9619019 0 0.6616541 0 1
## 264 0.7348503 0 0.6616541 0 0
## 265 0.6411702 0 0.6616541 1 0
## 266 0.8746534 0 0.6616541 1 0
## 267 0.6504270 0 0.6616541 1 1
## 268 0.9273858 1 1.0000000 0 1
## 269 0.9912170 1 1.0000000 1 1
## 270 0.7350740 0 0.6616541 1 0
## 271 0.8746534 0 0.6616541 1 0
## 272 0.6740189 0 0.6616541 1 1
## 273 0.9244426 0 0.6616541 0 0
## 274 0.6117752 0 0.6616541 0 0
## 275 0.5125766 0 0.6616541 1 0
## 276 0.8078549 0 0.6616541 0 0
## 277 0.8746534 0 0.6616541 0 0
## 278 0.9993863 1 1.0000000 1 1
## 279 0.9873018 1 1.0000000 0 0
## 280 0.5832589 0 0.6616541 0 0
## 281 0.8249900 0 0.6616541 0 0
## 282 0.9808298 0 0.6616541 1 1
## 283 0.7999144 0 0.6616541 1 1
## 284 0.6644174 0 0.6616541 0 0
## 285 0.6062395 0 0.6616541 0 0
## 286 0.6346054 0 0.6616541 1 0
## 287 0.7637505 0 0.6616541 1 1
## 288 0.8381115 0 0.6616541 0 0
## 289 0.7789835 0 0.6616541 1 0
## 290 0.8746534 0 0.6616541 1 0
## 291 0.5536288 0 0.6616541 0 1
## 292 0.9591116 0 0.6616541 1 1
## 293 0.8262910 0 0.6616541 1 1
## 294 0.6851140 0 0.6616541 0 0
## 295 0.6543690 0 0.6616541 0 0
## 296 0.5927256 0 0.6616541 0 0
## 297 0.8746534 0 0.6616541 1 0
## 298 0.9601066 0 0.6616541 1 1
## 299 0.8091701 0 0.6616541 0 0
## 300 0.7413772 0 0.6616541 0 0
## 301 0.6751602 0 0.6616541 1 0
## 302 0.8746534 0 0.6616541 1 0
## 303 0.9029062 0 0.6616541 1 0
## 304 0.7288891 0 0.6616541 1 0
## 305 0.6345383 1 1.0000000 0 1
## 306 0.9208078 0 0.6616541 0 0
## 307 0.5722034 0 0.6616541 1 0
## 308 0.9030216 0 0.6616541 1 0
## 309 0.8637540 0 0.6616541 0 0
## 310 0.5019201 0 0.6616541 0 0
## 311 0.6699482 0 0.6616541 1 0
## 312 0.9919655 1 1.0000000 1 1
## 313 0.8373843 0 0.6616541 1 0
## 314 0.9178607 1 1.0000000 0 0
## 315 0.8805268 1 1.0000000 1 1
## 316 0.8746534 0 0.6616541 1 0
## 317 0.7853267 0 0.6616541 0 0
## 318 0.8330249 0 0.6616541 1 1
## 319 0.6889858 0 0.6616541 1 0
## 320 0.6609509 0 0.6616541 0 0
## 321 0.6329276 0 0.6616541 0 0
## 322 0.8746534 0 0.6616541 0 0
## 323 0.8746534 0 0.6616541 0 0
## 324 0.6777173 0 0.6616541 1 0
## 325 0.8870735 0 0.6616541 1 0
## 326 0.6844521 0 0.6616541 1 0
## 327 0.8151715 0 0.6616541 1 0
## 328 0.7168045 0 0.6616541 1 0
## 329 0.9917773 1 1.0000000 1 1
## 330 0.9171512 0 0.6616541 0 0
## 331 0.8248358 0 0.6616541 1 1
## 332 0.8308865 0 0.6616541 1 0
## 333 0.6372078 1 1.0000000 0 1
## 334 0.6881883 0 0.6616541 1 1
## 335 0.9176487 0 0.6616541 0 0
## 336 0.8321578 0 0.6616541 1 0
## 337 0.7311433 0 0.6616541 0 0
## 338 0.8416069 0 0.6616541 0 0
## 339 0.5816619 0 0.6616541 1 1
## 340 0.8746534 0 0.6616541 1 0
## 341 0.6930336 0 0.6616541 0 1
## 342 0.7817926 0 0.6616541 1 0
## 343 0.8746534 0 0.6616541 0 0
## 344 0.8817977 0 0.6616541 0 0
## 345 0.6281196 0 0.6616541 1 0
## 346 0.9904422 1 1.0000000 1 1
## 347 0.6629598 0 0.6616541 0 0
## 348 0.7940204 0 0.6616541 1 0
## 349 0.8925611 0 0.6616541 0 0
## 350 0.9214585 0 0.6616541 0 0
## 351 0.8082103 0 0.6616541 0 0
## 352 0.9994504 0 0.6616541 1 1
## 353 0.8974176 0 0.6616541 1 0
## 354 0.7840688 0 0.6616541 0 0
## 355 0.8746534 0 0.6616541 1 0
## 356 0.8746534 0 0.6616541 1 0
## 357 0.7632438 0 0.6616541 0 0
## 358 0.7417794 0 0.6616541 0 1
## 359 0.5691815 0 0.6616541 0 0
## 360 0.8746534 0 0.6616541 1 0
## 361 0.8599967 1 1.0000000 1 1
## 362 0.7488982 0 0.6616541 1 1
## 363 0.7639572 0 0.6616541 1 1
## 364 0.6777089 0 0.6616541 1 1
## 365 0.8754256 0 0.6616541 0 0
## 366 0.9222848 0 0.6616541 1 0
## 367 0.6697521 0 0.6616541 1 1
## 368 0.9484218 0 0.6616541 0 0
## 369 0.6887910 0 0.6616541 0 0
## 370 0.6279341 1 1.0000000 0 0
## 371 0.8746265 0 0.6616541 0 0
## 372 0.8746534 0 0.6616541 0 0
## 373 0.6400280 0 0.6616541 0 0
## 374 0.5251890 0 0.6616541 1 0
## 375 0.8746534 0 0.6616541 0 0
## 376 0.6334517 0 0.6616541 0 0
## 377 0.7636947 0 0.6616541 0 0
## 378 0.8577017 0 0.6616541 1 0
## 379 0.9644245 0 0.6616541 0 0
## 380 0.7237385 0 0.6616541 1 0
## 381 0.8618935 0 0.6616541 0 0
## 382 0.7836579 0 0.6616541 1 1
## 383 0.9728826 0 0.6616541 1 1
## 384 0.9425455 0 0.6616541 0 0
## 385 0.5030809 1 1.0000000 1 1
## 386 0.9282131 1 1.0000000 1 1
## 387 0.7008030 0 0.6616541 0 0
## 388 0.9067455 0 0.6616541 0 0
## 389 0.8183036 0 0.6616541 1 0
## 390 0.8579085 0 0.6616541 0 0
## 391 0.6819387 0 0.6616541 0 1
## 392 0.7658199 0 0.6616541 0 0
## 393 0.8746534 0 0.6616541 1 0
## 394 0.8881393 0 0.6616541 0 0
## 395 0.8894261 0 0.6616541 0 0
## 396 0.8746534 0 0.6616541 0 0
## 397 0.9375659 0 0.6616541 0 0
## 398 0.9023910 0 0.6616541 0 0
## 399 0.8746534 0 0.6616541 1 0
## 400 0.9214765 0 0.6616541 1 1
## 401 0.7005511 0 0.6616541 0 0
## 402 0.8746534 0 0.6616541 0 0
## 403 0.7470252 0 0.6616541 0 0
## 404 0.7710190 0 0.6616541 1 0
## 405 0.8996229 1 1.0000000 1 1
## 406 0.6804060 0 0.6616541 1 1
## 407 0.9371866 1 1.0000000 0 1
## 408 0.9301479 0 0.6616541 1 1
## 409 0.7996398 0 0.6616541 0 0
## 410 0.5743247 0 0.6616541 0 0
## 411 0.6372078 1 1.0000000 1 1
## 412 0.8746534 0 0.6616541 1 0
## 413 0.7699589 0 0.6616541 1 1
## 414 0.8001484 0 0.6616541 0 0
## 415 0.9073975 0 0.6616541 0 0
## 416 0.8746534 0 0.6616541 0 0
## 417 0.8887497 0 0.6616541 1 0
## 418 0.6116572 1 1.0000000 0 1
## 419 0.8640359 1 1.0000000 1 1
## 420 0.6981186 0 0.6616541 1 1
## 421 0.8806481 0 0.6616541 0 0
## 422 0.8746496 0 0.6616541 0 0
## 423 0.9123808 0 0.6616541 1 1
## 424 0.8746534 0 0.6616541 1 0
## 425 0.8146468 0 0.6616541 0 0
## 426 0.9201436 0 0.6616541 1 0
## 427 0.9104727 0 0.6616541 0 0
## 428 0.8746534 0 0.6616541 0 0
## 429 0.7007718 0 0.6616541 1 0
## 430 0.9449447 0 0.6616541 0 0
## 431 0.8185533 0 0.6616541 0 0
## 432 0.6710381 0 0.6616541 1 0
## 433 0.7552511 0 0.6616541 1 1
## 434 0.9278294 0 0.6616541 0 0
## 435 0.9001330 1 1.0000000 1 1
## 436 0.8746534 0 0.6616541 1 0
## 437 0.8360667 0 0.6616541 0 0
## 438 0.6116088 0 0.6616541 1 0
## 439 0.7740231 0 0.6616541 0 0
## 440 0.9971169 1 1.0000000 1 1
## 441 0.5545698 1 1.0000000 1 1
## 442 0.5498202 0 0.6616541 0 0
## 443 0.7276496 0 0.6616541 1 0
## 444 0.9770802 0 0.6616541 1 0
## 445 0.8548578 0 0.6616541 0 0
## 446 0.8414296 0 0.6616541 0 0
## 447 0.5707888 0 0.6616541 1 0
## 448 0.5907567 0 0.6616541 0 0
## 449 0.8746534 0 0.6616541 0 0
## 450 0.9190581 1 1.0000000 1 1
## 451 0.9155459 0 0.6616541 0 0
## 452 0.8853009 0 0.6616541 0 0
## 453 0.7696008 0 0.6616541 1 0
## 454 0.7228904 0 0.6616541 1 1
## 455 0.9448578 1 1.0000000 0 0
## 456 0.8808083 0 0.6616541 1 0
## 457 0.7826945 0 0.6616541 1 1
## 458 0.8746534 0 0.6616541 0 0
## 459 0.9801429 1 1.0000000 1 1
## 460 0.8924172 0 0.6616541 0 0
## 461 0.8746534 0 0.6616541 1 0
## 462 0.8746534 0 0.6616541 1 0
## 463 0.6770943 1 1.0000000 0 1
## 464 0.8975955 0 0.6616541 1 0
## 465 0.6621775 0 0.6616541 1 1
## 466 0.8594049 0 0.6616541 1 1
## 467 0.8494618 0 0.6616541 0 0
## 468 0.7852747 0 0.6616541 1 0
## 469 0.8843624 0 0.6616541 0 0
## 470 0.8746534 0 0.6616541 0 0
## 471 0.8746534 0 0.6616541 0 0
## 472 0.7880006 0 0.6616541 0 1
## 473 0.8761443 0 0.6616541 0 0
## 474 0.8524260 0 0.6616541 0 0
## 475 0.8847374 0 0.6616541 1 0
## 476 0.9418128 0 0.6616541 0 0
## 477 0.9218629 0 0.6616541 0 0
## 478 0.8716173 0 0.6616541 0 0
## 479 0.9429103 0 0.6616541 0 0
## 480 0.6808381 0 0.6616541 0 0
## 481 0.8661160 0 0.6616541 1 1
## 482 0.8672682 0 0.6616541 0 0
## 483 0.5456351 0 0.6616541 1 0
## 484 0.7057263 0 0.6616541 0 1
## 485 0.9029786 1 1.0000000 0 0
## 486 0.5994841 0 0.6616541 1 0
## 487 0.9048033 0 0.6616541 1 1
## 488 0.8632586 0 0.6616541 0 0
## 489 0.5560813 0 0.6616541 0 0
## 490 0.8322249 0 0.6616541 0 0
## 491 0.8788313 0 0.6616541 0 0
## 492 0.8876063 0 0.6616541 1 0
## 493 0.6241686 1 1.0000000 1 1
## 494 0.8148390 1 1.0000000 1 0
## 495 0.6538010 0 0.6616541 0 0
## 496 0.9102920 0 0.6616541 0 0
## 497 0.9044113 0 0.6616541 0 0
## 498 0.7591783 0 0.6616541 1 0
## 499 0.7433828 0 0.6616541 0 0
## 500 0.8746534 0 0.6616541 0 0
## CONSENSUS_AGREE CONSENSUS_INCORRECT PROBABILITY_CODE
## 1 2 0 1
## 2 4 0 0
## 3 4 0 1
## 4 2 0 0
## 5 2 1 0
## 6 3 1 0
## 7 2 1 0
## 8 4 0 0
## 9 4 0 0
## 10 4 0 0
## 11 4 0 0
## 12 4 1 1
## 13 3 0 0
## 14 4 0 0
## 15 4 0 0
## 16 4 0 0
## 17 4 1 0
## 18 4 0 0
## 19 3 1 0
## 20 2 1 0
## 21 2 1 1
## 22 3 0 0
## 23 4 0 1
## 24 4 1 0
## 25 4 0 1
## 26 4 0 0
## 27 4 0 0
## 28 3 0 1
## 29 4 0 0
## 30 2 0 0
## 31 3 0 0
## 32 3 0 1
## 33 4 0 1
## 34 2 1 1
## 35 3 0 1
## 36 4 0 0
## 37 3 1 0
## 38 2 1 1
## 39 2 1 1
## 40 4 0 0
## 41 2 1 1
## 42 4 0 0
## 43 3 1 1
## 44 3 0 0
## 45 4 0 0
## 46 3 1 0
## 47 4 0 0
## 48 3 0 1
## 49 3 1 1
## 50 4 0 0
## 51 4 1 0
## 52 2 0 1
## 53 4 1 0
## 54 4 1 0
## 55 2 1 1
## 56 2 1 0
## 57 4 0 0
## 58 3 0 0
## 59 3 1 1
## 60 2 0 0
## 61 4 0 1
## 62 4 0 1
## 63 3 1 0
## 64 3 0 1
## 65 4 0 1
## 66 3 1 0
## 67 3 0 0
## 68 4 0 0
## 69 4 0 0
## 70 4 0 0
## 71 4 0 1
## 72 2 0 1
## 73 2 1 0
## 74 4 1 1
## 75 4 0 0
## 76 4 0 0
## 77 2 0 0
## 78 2 1 0
## 79 2 0 1
## 80 2 1 1
## 81 4 0 0
## 82 4 0 1
## 83 2 0 1
## 84 2 0 1
## 85 4 0 0
## 86 4 1 0
## 87 4 0 0
## 88 2 0 0
## 89 4 0 0
## 90 2 0 1
## 91 4 1 0
## 92 4 0 0
## 93 2 1 1
## 94 3 0 1
## 95 2 1 1
## 96 3 0 0
## 97 3 1 0
## 98 3 0 0
## 99 2 0 1
## 100 3 0 0
## 101 4 0 0
## 102 3 0 1
## 103 3 0 1
## 104 3 1 1
## 105 2 1 1
## 106 3 0 1
## 107 3 1 0
## 108 4 0 0
## 109 4 0 0
## 110 2 1 1
## 111 4 0 0
## 112 4 0 1
## 113 4 0 0
## 114 4 0 0
## 115 3 1 1
## 116 2 1 1
## 117 4 0 0
## 118 4 1 0
## 119 4 0 1
## 120 3 1 0
## 121 4 0 0
## 122 2 1 1
## 123 4 0 0
## 124 4 0 0
## 125 2 1 1
## 126 4 0 0
## 127 4 0 0
## 128 4 1 0
## 129 4 0 0
## 130 4 0 0
## 131 2 1 1
## 132 4 0 0
## 133 2 1 1
## 134 3 0 1
## 135 2 0 1
## 136 2 0 0
## 137 3 0 0
## 138 3 1 1
## 139 2 1 1
## 140 4 0 0
## 141 4 0 0
## 142 2 1 1
## 143 4 0 0
## 144 4 0 0
## 145 4 0 0
## 146 4 1 0
## 147 4 0 0
## 148 4 0 0
## 149 3 0 0
## 150 3 1 1
## 151 2 1 1
## 152 2 1 1
## 153 2 0 1
## 154 4 0 0
## 155 4 1 0
## 156 4 0 0
## 157 3 0 1
## 158 2 1 1
## 159 4 0 0
## 160 2 1 1
## 161 4 0 0
## 162 3 0 0
## 163 4 0 0
## 164 4 0 0
## 165 3 0 1
## 166 2 1 1
## 167 4 0 0
## 168 4 0 0
## 169 4 0 0
## 170 3 1 0
## 171 4 0 1
## 172 4 1 0
## 173 2 1 1
## 174 4 0 0
## 175 2 1 1
## 176 4 1 1
## 177 4 0 0
## 178 2 1 0
## 179 4 1 0
## 180 3 0 1
## 181 2 1 1
## 182 4 0 0
## 183 3 1 0
## 184 3 0 1
## 185 4 0 0
## 186 4 0 0
## 187 2 1 1
## 188 4 0 0
## 189 3 0 0
## 190 3 0 1
## 191 4 0 0
## 192 3 0 1
## 193 4 0 0
## 194 4 1 0
## 195 3 0 1
## 196 3 0 1
## 197 2 1 1
## 198 4 0 0
## 199 3 0 1
## 200 2 0 1
## 201 4 1 0
## 202 2 0 0
## 203 2 0 0
## 204 4 0 0
## 205 2 1 0
## 206 2 1 1
## 207 3 0 1
## 208 2 1 1
## 209 2 1 1
## 210 4 0 0
## 211 2 1 1
## 212 4 0 0
## 213 2 1 1
## 214 4 0 0
## 215 4 0 0
## 216 4 0 0
## 217 2 1 1
## 218 4 0 0
## 219 3 1 0
## 220 4 0 0
## 221 3 0 1
## 222 3 1 0
## 223 2 1 1
## 224 2 1 1
## 225 3 0 0
## 226 2 1 1
## 227 4 0 0
## 228 3 1 1
## 229 3 0 1
## 230 2 1 1
## 231 2 0 1
## 232 4 0 0
## 233 4 0 0
## 234 3 0 0
## 235 4 0 0
## 236 2 1 1
## 237 2 1 0
## 238 4 0 0
## 239 2 1 1
## 240 3 0 1
## 241 3 1 1
## 242 4 0 0
## 243 3 0 0
## 244 2 1 1
## 245 4 0 0
## 246 4 0 1
## 247 4 1 0
## 248 3 0 1
## 249 4 0 0
## 250 4 0 1
## 251 4 0 0
## 252 2 0 1
## 253 4 0 0
## 254 2 1 1
## 255 4 1 0
## 256 2 1 1
## 257 4 0 0
## 258 2 1 1
## 259 3 0 1
## 260 2 1 1
## 261 4 1 0
## 262 4 0 0
## 263 3 1 1
## 264 4 0 0
## 265 2 1 1
## 266 2 1 1
## 267 3 0 1
## 268 3 1 1
## 269 4 0 1
## 270 2 1 1
## 271 4 1 0
## 272 3 0 1
## 273 4 0 0
## 274 2 0 0
## 275 2 1 1
## 276 4 0 0
## 277 4 0 0
## 278 4 0 1
## 279 2 0 1
## 280 4 0 0
## 281 4 0 0
## 282 3 0 1
## 283 3 0 1
## 284 4 0 0
## 285 4 0 0
## 286 2 1 1
## 287 3 0 1
## 288 3 0 0
## 289 2 1 1
## 290 2 1 1
## 291 3 1 0
## 292 3 0 1
## 293 3 0 1
## 294 4 0 0
## 295 4 0 0
## 296 4 0 0
## 297 3 1 0
## 298 3 0 1
## 299 2 0 0
## 300 4 0 0
## 301 4 1 0
## 302 2 1 1
## 303 2 1 1
## 304 2 1 1
## 305 3 1 1
## 306 4 0 0
## 307 2 1 1
## 308 2 1 0
## 309 4 0 0
## 310 3 0 0
## 311 4 1 0
## 312 4 0 1
## 313 4 1 0
## 314 2 0 1
## 315 4 0 1
## 316 4 1 0
## 317 3 0 1
## 318 3 0 1
## 319 2 1 1
## 320 4 0 0
## 321 4 0 0
## 322 4 0 0
## 323 3 0 0
## 324 2 1 0
## 325 2 1 1
## 326 2 1 1
## 327 2 1 1
## 328 2 1 1
## 329 4 0 1
## 330 4 0 0
## 331 3 0 1
## 332 2 1 1
## 333 4 1 1
## 334 3 0 1
## 335 4 0 0
## 336 2 1 0
## 337 3 0 0
## 338 4 0 0
## 339 3 0 1
## 340 2 1 1
## 341 3 1 1
## 342 2 1 0
## 343 4 0 0
## 344 3 0 0
## 345 2 1 1
## 346 4 0 1
## 347 4 0 0
## 348 2 1 1
## 349 4 0 0
## 350 3 0 0
## 351 4 0 0
## 352 3 0 1
## 353 3 1 0
## 354 4 0 0
## 355 2 1 1
## 356 2 1 0
## 357 4 0 0
## 358 3 1 1
## 359 4 0 0
## 360 4 1 0
## 361 4 0 1
## 362 3 0 1
## 363 3 0 1
## 364 3 0 1
## 365 4 0 0
## 366 2 1 0
## 367 3 0 1
## 368 2 0 1
## 369 3 0 0
## 370 2 0 1
## 371 4 0 0
## 372 4 0 0
## 373 4 0 0
## 374 3 1 0
## 375 2 0 1
## 376 4 0 0
## 377 4 0 0
## 378 3 1 0
## 379 3 0 1
## 380 2 1 1
## 381 3 0 0
## 382 3 0 1
## 383 3 0 1
## 384 4 0 0
## 385 3 0 1
## 386 4 0 1
## 387 4 0 0
## 388 4 0 0
## 389 2 1 1
## 390 4 0 0
## 391 3 1 1
## 392 4 0 0
## 393 2 1 1
## 394 4 0 0
## 395 4 0 0
## 396 4 0 0
## 397 2 0 1
## 398 4 0 0
## 399 2 1 1
## 400 3 0 1
## 401 4 0 0
## 402 4 0 0
## 403 4 0 0
## 404 2 1 1
## 405 4 0 1
## 406 3 0 1
## 407 4 1 1
## 408 3 0 1
## 409 3 0 0
## 410 4 0 0
## 411 4 0 1
## 412 3 1 0
## 413 3 0 1
## 414 4 0 0
## 415 2 0 1
## 416 4 0 0
## 417 3 1 0
## 418 3 1 1
## 419 3 0 1
## 420 3 0 1
## 421 4 0 0
## 422 4 0 0
## 423 3 0 1
## 424 4 1 0
## 425 4 0 0
## 426 4 1 0
## 427 4 0 0
## 428 4 0 0
## 429 3 1 1
## 430 4 0 0
## 431 4 0 0
## 432 4 1 0
## 433 3 0 1
## 434 4 0 0
## 435 4 0 1
## 436 2 1 0
## 437 4 0 0
## 438 2 1 1
## 439 2 0 1
## 440 4 0 1
## 441 4 0 1
## 442 4 0 0
## 443 2 1 1
## 444 2 1 1
## 445 4 0 0
## 446 4 0 0
## 447 3 1 0
## 448 3 0 0
## 449 2 0 0
## 450 4 0 1
## 451 4 0 0
## 452 3 0 1
## 453 3 1 0
## 454 3 0 1
## 455 2 0 1
## 456 3 1 0
## 457 3 0 1
## 458 4 0 0
## 459 4 0 1
## 460 4 0 0
## 461 3 1 0
## 462 2 1 0
## 463 3 1 1
## 464 2 1 1
## 465 3 0 1
## 466 3 0 1
## 467 2 0 1
## 468 4 1 0
## 469 4 0 0
## 470 4 0 0
## 471 3 0 0
## 472 3 1 1
## 473 4 0 0
## 474 2 0 1
## 475 3 1 0
## 476 4 0 0
## 477 4 0 0
## 478 4 0 0
## 479 4 0 0
## 480 4 0 0
## 481 3 0 1
## 482 2 0 1
## 483 2 1 0
## 484 3 1 1
## 485 2 0 1
## 486 2 1 1
## 487 3 0 1
## 488 4 0 0
## 489 4 0 0
## 490 4 0 0
## 491 4 0 0
## 492 2 1 1
## 493 4 0 1
## 494 2 1 1
## 495 2 0 1
## 496 4 0 0
## 497 4 0 0
## 498 2 1 1
## 499 4 0 0
## 500 4 0 0
## PROBABILITY_INCORRECT
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
## 7 1
## 8 0
## 9 0
## 10 0
## 11 0
## 12 1
## 13 0
## 14 0
## 15 0
## 16 0
## 17 1
## 18 0
## 19 1
## 20 1
## 21 0
## 22 0
## 23 0
## 24 1
## 25 0
## 26 0
## 27 0
## 28 1
## 29 0
## 30 0
## 31 0
## 32 0
## 33 0
## 34 0
## 35 0
## 36 0
## 37 1
## 38 0
## 39 0
## 40 0
## 41 0
## 42 0
## 43 0
## 44 0
## 45 0
## 46 1
## 47 0
## 48 0
## 49 1
## 50 0
## 51 1
## 52 1
## 53 1
## 54 1
## 55 0
## 56 1
## 57 0
## 58 0
## 59 0
## 60 0
## 61 0
## 62 0
## 63 1
## 64 0
## 65 0
## 66 1
## 67 0
## 68 0
## 69 0
## 70 0
## 71 0
## 72 1
## 73 1
## 74 1
## 75 0
## 76 0
## 77 0
## 78 1
## 79 1
## 80 0
## 81 0
## 82 0
## 83 1
## 84 1
## 85 0
## 86 1
## 87 0
## 88 0
## 89 0
## 90 1
## 91 1
## 92 0
## 93 0
## 94 0
## 95 0
## 96 0
## 97 1
## 98 0
## 99 1
## 100 1
## 101 0
## 102 0
## 103 0
## 104 1
## 105 0
## 106 0
## 107 1
## 108 0
## 109 0
## 110 0
## 111 0
## 112 0
## 113 0
## 114 0
## 115 0
## 116 0
## 117 0
## 118 1
## 119 0
## 120 1
## 121 0
## 122 0
## 123 0
## 124 0
## 125 0
## 126 0
## 127 0
## 128 1
## 129 0
## 130 0
## 131 0
## 132 0
## 133 0
## 134 0
## 135 1
## 136 0
## 137 0
## 138 1
## 139 0
## 140 0
## 141 0
## 142 0
## 143 0
## 144 0
## 145 0
## 146 1
## 147 0
## 148 0
## 149 0
## 150 0
## 151 0
## 152 0
## 153 1
## 154 0
## 155 1
## 156 0
## 157 0
## 158 0
## 159 0
## 160 0
## 161 0
## 162 0
## 163 0
## 164 0
## 165 0
## 166 0
## 167 0
## 168 0
## 169 0
## 170 1
## 171 0
## 172 1
## 173 0
## 174 0
## 175 0
## 176 1
## 177 0
## 178 1
## 179 1
## 180 0
## 181 0
## 182 0
## 183 1
## 184 0
## 185 0
## 186 0
## 187 0
## 188 0
## 189 0
## 190 0
## 191 0
## 192 0
## 193 0
## 194 1
## 195 0
## 196 0
## 197 0
## 198 0
## 199 1
## 200 1
## 201 1
## 202 0
## 203 0
## 204 0
## 205 1
## 206 0
## 207 0
## 208 0
## 209 0
## 210 0
## 211 0
## 212 0
## 213 0
## 214 0
## 215 0
## 216 0
## 217 0
## 218 0
## 219 1
## 220 0
## 221 0
## 222 1
## 223 0
## 224 0
## 225 0
## 226 0
## 227 0
## 228 1
## 229 0
## 230 0
## 231 1
## 232 0
## 233 0
## 234 0
## 235 0
## 236 0
## 237 1
## 238 0
## 239 0
## 240 1
## 241 1
## 242 0
## 243 0
## 244 0
## 245 0
## 246 0
## 247 1
## 248 0
## 249 0
## 250 0
## 251 0
## 252 1
## 253 0
## 254 0
## 255 1
## 256 0
## 257 0
## 258 0
## 259 0
## 260 0
## 261 1
## 262 0
## 263 1
## 264 0
## 265 0
## 266 0
## 267 0
## 268 1
## 269 0
## 270 0
## 271 1
## 272 0
## 273 0
## 274 0
## 275 0
## 276 0
## 277 0
## 278 0
## 279 1
## 280 0
## 281 0
## 282 0
## 283 0
## 284 0
## 285 0
## 286 0
## 287 0
## 288 0
## 289 0
## 290 0
## 291 0
## 292 0
## 293 0
## 294 0
## 295 0
## 296 0
## 297 1
## 298 0
## 299 0
## 300 0
## 301 1
## 302 0
## 303 0
## 304 0
## 305 1
## 306 0
## 307 0
## 308 1
## 309 0
## 310 0
## 311 1
## 312 0
## 313 1
## 314 1
## 315 0
## 316 1
## 317 1
## 318 0
## 319 0
## 320 0
## 321 0
## 322 0
## 323 0
## 324 1
## 325 0
## 326 0
## 327 0
## 328 0
## 329 0
## 330 0
## 331 0
## 332 0
## 333 1
## 334 0
## 335 0
## 336 1
## 337 0
## 338 0
## 339 0
## 340 0
## 341 1
## 342 1
## 343 0
## 344 0
## 345 0
## 346 0
## 347 0
## 348 0
## 349 0
## 350 0
## 351 0
## 352 0
## 353 1
## 354 0
## 355 0
## 356 1
## 357 0
## 358 1
## 359 0
## 360 1
## 361 0
## 362 0
## 363 0
## 364 0
## 365 0
## 366 1
## 367 0
## 368 1
## 369 0
## 370 1
## 371 0
## 372 0
## 373 0
## 374 1
## 375 1
## 376 0
## 377 0
## 378 1
## 379 1
## 380 0
## 381 0
## 382 0
## 383 0
## 384 0
## 385 0
## 386 0
## 387 0
## 388 0
## 389 0
## 390 0
## 391 1
## 392 0
## 393 0
## 394 0
## 395 0
## 396 0
## 397 1
## 398 0
## 399 0
## 400 0
## 401 0
## 402 0
## 403 0
## 404 0
## 405 0
## 406 0
## 407 1
## 408 0
## 409 0
## 410 0
## 411 0
## 412 1
## 413 0
## 414 0
## 415 1
## 416 0
## 417 1
## 418 1
## 419 0
## 420 0
## 421 0
## 422 0
## 423 0
## 424 1
## 425 0
## 426 1
## 427 0
## 428 0
## 429 0
## 430 0
## 431 0
## 432 1
## 433 0
## 434 0
## 435 0
## 436 1
## 437 0
## 438 0
## 439 1
## 440 0
## 441 0
## 442 0
## 443 0
## 444 0
## 445 0
## 446 0
## 447 1
## 448 0
## 449 0
## 450 0
## 451 0
## 452 1
## 453 1
## 454 0
## 455 1
## 456 1
## 457 0
## 458 0
## 459 0
## 460 0
## 461 1
## 462 1
## 463 1
## 464 0
## 465 0
## 466 0
## 467 1
## 468 1
## 469 0
## 470 0
## 471 0
## 472 1
## 473 0
## 474 1
## 475 1
## 476 0
## 477 0
## 478 0
## 479 0
## 480 0
## 481 0
## 482 1
## 483 1
## 484 1
## 485 1
## 486 0
## 487 0
## 488 0
## 489 0
## 490 0
## 491 0
## 492 0
## 493 0
## 494 0
## 495 1
## 496 0
## 497 0
## 498 0
## 499 0
## 500 0
analytics@ensemble_summary # SUMMARY OF ENSEMBLE PRECISION/COVERAGE. USES THE n VARIABLE PASSED INTO create_analytics()## n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1 1.00 0.65
## n >= 2 1.00 0.65
## n >= 3 0.73 0.79
## n >= 4 0.46 0.86
#CONFUSION MATRIX
yhat = as.matrix(analytics@document_summary$CONSENSUS_CODE)
y = flag[(n/2+1):n]
print(table(y,yhat))## yhat
## y 0 1
## 0 238 21
## 1 154 87
In recent years, the SAT exams added a new essay section. While the test aimed at assessing original writing, it also introduced automated grading. A goal of the test is to assess the writing level of the student. This is associated with the notion of readability.
“Readability” is a metric of how easy it is to comprehend text. Given a goal of efficient markets, regulators want to foster transparency by making sure financial documents that are disseminated to the investing public are readable. Hence, metrics for readability are very important and are recently gaining traction.
Gunning (1952) developed the Fog index. The index estimates the years of formal education needed to understand text on a first reading. A fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The index is based on the idea that poor readability is associated with longer sentences and complex words. Complex words are those that have more than two syllables. The formula for the Fog index is
\[ 0.4 \cdot \left[\frac{\mbox{\#words}}{\mbox{\#sentences}} + 100 \cdot \left( \frac{\mbox{\#complex words}}{\mbox{\#words}} \right) \right] \]
Alternative readability scores use similar ideas. The Flesch Reading Ease Score and the Flesch-Kincaid Grade Level also use counts of words, syllables, and sentences. See http://en.wikipedia.org/wiki/Flesch-Kincaid_readability_tests. The Flesch Reading Ease Score is defined as
\[ 206.835 - 1.015 \left(\frac{\mbox{\#words}}{\mbox{\#sentences}}\right) - 84.6 \left( \frac{\mbox{\#syllables}}{\mbox{\#words}} \right) \]
With a range of 90-100 easily accessible by a 11-year old, 60-70 being easy to understand for 13-15 year olds, and 0-30 for university graduates.
This is defined as
\[ 0.39 \left(\frac{\mbox{\#words}}{\mbox{\#sentences}}\right) + 11.8 \left( \frac{\mbox{\#syllables}}{\mbox{\#words}} \right) -15.59 \]
which gives a number that corresponds to the grade level. As expected these two measures are negatively correlated. Various other measures of readability use the same ideas as in the Fog index. For example the Coleman and Liau (1975) index does not even require a count of syllables, as follows:
\[ CLI = 0.0588 L - 0.296 S - 15.8 \]
where \(L\) is the average number of letters per hundred words and \(S\) is the average number of sentences per hundred words.
Standard readability metrics may not work well for financial text. Loughran and McDonald (2014) find that the Fog index is inferior to simply looking at 10-K file size.
References
M. Coleman and T. L. Liau. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology 60, 283-284.
T. Loughran and W. McDonald, (2014). Measuring readability in financial disclosures, The Journal of Finance 69, 1643-1671.
R package koRpus for readability scoring here. http://www.inside-r.org/packages/cran/koRpus/docs/readability
First, let’s grab some text from my web site.
library(rvest)
url = "http://srdas.github.io/bio-candid.html"
doc.html = read_html(url)
text = doc.html %>% html_nodes("p") %>% html_text()
text = gsub("[\t\n]"," ",text)
text = gsub('"'," ",text) #removes single backslash
text = paste(text, collapse=" ")
print(text)## [1] " Sanjiv Das: A Short Academic Life History After loafing and working in many parts of Asia, but never really growing up, Sanjiv moved to New York to change the world, hopefully through research. He graduated in 1994 with a Ph.D. from NYU, and since then spent five years in Boston, and now lives in San Jose, California. Sanjiv loves animals, places in the world where the mountains meet the sea, riding sport motorbikes, reading, gadgets, science fiction movies, and writing cool software code. When there is time available from the excitement of daily life, Sanjiv writes academic papers, which helps him relax. Always the contrarian, Sanjiv thinks that New York City is the most calming place in the world, after California of course. Sanjiv is now a Professor of Finance at Santa Clara University. He came to SCU from Harvard Business School and spent a year at UC Berkeley. In his past life in the unreal world, Sanjiv worked at Citibank, N.A. in the Asia-Pacific region. He takes great pleasure in merging his many previous lives into his current existence, which is incredibly confused and diverse. Sanjiv's research style is instilled with a distinct New York state of mind - it is chaotic, diverse, with minimal method to the madness. He has published articles on derivatives, term-structure models, mutual funds, the internet, portfolio choice, banking models, credit risk, and has unpublished articles in many other areas. Some years ago, he took time off to get another degree in computer science at Berkeley, confirming that an unchecked hobby can quickly become an obsession. There he learnt about the fascinating field of Randomized Algorithms, skills he now applies earnestly to his editorial work, and other pursuits, many of which stem from being in the epicenter of Silicon Valley. Coastal living did a lot to mold Sanjiv, who needs to live near the ocean. The many walks in Greenwich village convinced him that there is no such thing as a representative investor, yet added many unique features to his personal utility function. He learnt that it is important to open the academic door to the ivory tower and let the world in. Academia is a real challenge, given that he has to reconcile many more opinions than ideas. He has been known to have turned down many offers from Mad magazine to publish his academic work. As he often explains, you never really finish your education - you can check out any time you like, but you can never leave. Which is why he is doomed to a lifetime in Hotel California. And he believes that, if this is as bad as it gets, life is really pretty good. "
Now we can assess it for readability.
library(koRpus)## Warning: package 'koRpus' was built under R version 3.2.5
##
## Attaching package: 'koRpus'
## The following object is masked from 'package:dplyr':
##
## query
## The following object is masked from 'package:qdap':
##
## SMOG
## The following object is masked from 'package:lsa':
##
## query
write(text,file="textvec.txt")
text_tokens = tokenize("textvec.txt",lang="en")
#print(text_tokens)
print(c("Number of sentences: ",text_tokens@desc$sentences))## [1] "Number of sentences: " "24"
print(c("Number of words: ",text_tokens@desc$words))## [1] "Number of words: " "446"
print(c("Number of words per sentence: ",text_tokens@desc$avg.sentc.length))## [1] "Number of words per sentence: " "18.5833333333333"
print(c("Average length of words: ",text_tokens@desc$avg.word.length))## [1] "Average length of words: " "4.67488789237668"
Next we generate several indices of readability, which are worth looking at.
print(readability(text_tokens))## Hyphenation (language: en)
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|==== | 7%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|====== | 10%
|
|======= | 10%
|
|======= | 11%
|
|======== | 12%
|
|======== | 13%
|
|========= | 13%
|
|========= | 14%
|
|========= | 15%
|
|========== | 15%
|
|========== | 16%
|
|=========== | 16%
|
|=========== | 17%
|
|============ | 18%
|
|============ | 19%
|
|============= | 19%
|
|============= | 20%
|
|============= | 21%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 22%
|
|=============== | 23%
|
|=============== | 24%
|
|================ | 24%
|
|================ | 25%
|
|================= | 26%
|
|================= | 27%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 28%
|
|=================== | 29%
|
|=================== | 30%
|
|==================== | 30%
|
|==================== | 31%
|
|===================== | 32%
|
|===================== | 33%
|
|====================== | 33%
|
|====================== | 34%
|
|====================== | 35%
|
|======================= | 35%
|
|======================= | 36%
|
|======================== | 36%
|
|======================== | 37%
|
|======================== | 38%
|
|========================= | 38%
|
|========================= | 39%
|
|========================== | 39%
|
|========================== | 40%
|
|========================== | 41%
|
|=========================== | 41%
|
|=========================== | 42%
|
|============================ | 42%
|
|============================ | 43%
|
|============================ | 44%
|
|============================= | 44%
|
|============================= | 45%
|
|============================== | 46%
|
|============================== | 47%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 50%
|
|================================= | 51%
|
|================================== | 52%
|
|================================== | 53%
|
|=================================== | 53%
|
|=================================== | 54%
|
|==================================== | 55%
|
|==================================== | 56%
|
|===================================== | 56%
|
|===================================== | 57%
|
|===================================== | 58%
|
|====================================== | 58%
|
|====================================== | 59%
|
|======================================= | 59%
|
|======================================= | 60%
|
|======================================= | 61%
|
|======================================== | 61%
|
|======================================== | 62%
|
|========================================= | 62%
|
|========================================= | 63%
|
|========================================= | 64%
|
|========================================== | 64%
|
|========================================== | 65%
|
|=========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 67%
|
|============================================ | 68%
|
|============================================= | 69%
|
|============================================= | 70%
|
|============================================== | 70%
|
|============================================== | 71%
|
|============================================== | 72%
|
|=============================================== | 72%
|
|=============================================== | 73%
|
|================================================ | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 76%
|
|================================================== | 77%
|
|================================================== | 78%
|
|=================================================== | 78%
|
|=================================================== | 79%
|
|==================================================== | 79%
|
|==================================================== | 80%
|
|==================================================== | 81%
|
|===================================================== | 81%
|
|===================================================== | 82%
|
|====================================================== | 83%
|
|====================================================== | 84%
|
|======================================================= | 84%
|
|======================================================= | 85%
|
|======================================================== | 85%
|
|======================================================== | 86%
|
|======================================================== | 87%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|========================================================== | 89%
|
|========================================================== | 90%
|
|=========================================================== | 90%
|
|=========================================================== | 91%
|
|============================================================ | 92%
|
|============================================================ | 93%
|
|============================================================= | 93%
|
|============================================================= | 94%
|
|============================================================== | 95%
|
|============================================================== | 96%
|
|=============================================================== | 96%
|
|=============================================================== | 97%
|
|=============================================================== | 98%
|
|================================================================ | 98%
|
|================================================================ | 99%
|
|=================================================================| 99%
|
|=================================================================| 100%
## Warning: Bormuth: Missing word list, hence not calculated.
## Warning: Coleman: POS tags are not elaborate enough, can't count pronouns
## and prepositions. Formulae skipped.
## Warning: Dale-Chall: Missing word list, hence not calculated.
## Warning: DRP: Missing Bormuth Mean Cloze, hence not calculated.
## Warning: Harris.Jacobson: Missing word list, hence not calculated.
## Warning: Spache: Missing word list, hence not calculated.
## Warning: Traenkle.Bailer: POS tags are not elaborate enough, can't count
## prepositions and conjuctions. Formulae skipped.
## Warning: Note: The implementations of these formulas are still subject to validation:
## Coleman, Danielson.Bryan, Dickes.Steiwer, ELF, Fucks, Harris.Jacobson, nWS, Strain, Traenkle.Bailer, TRI
## Use the results with caution, even if they seem plausible!
##
## Automated Readability Index (ARI)
## Parameters: default
## Grade: 9.88
##
##
## Coleman-Liau
## Parameters: default
## ECP: 47% (estimted cloze percentage)
## Grade: 10.09
## Grade: 10.1 (short formula)
##
##
## Danielson-Bryan
## Parameters: default
## DB1: 7.64
## DB2: 48.58
## Grade: 9-12
##
##
## Dickes-Steiwer's Handformel
## Parameters: default
## TTR: 0.58
## Score: 42.76
##
##
## Easy Listening Formula
## Parameters: default
## Exsyls: 149
## Score: 6.21
##
##
## Farr-Jenkins-Paterson
## Parameters: default
## RE: 56.1
## Grade: >= 10 (high school)
##
##
## Flesch Reading Ease
## Parameters: en (Flesch)
## RE: 59.75
## Grade: >= 10 (high school)
##
##
## Flesch-Kincaid Grade Level
## Parameters: default
## Grade: 9.54
## Age: 14.54
##
##
## Gunning Frequency of Gobbledygook (FOG)
## Parameters: default
## Grade: 12.55
##
##
## FORCAST
## Parameters: default
## Grade: 10.01
## Age: 15.01
##
##
## Fucks' Stilcharakteristik
## Score: 86.88
## Grade: 9.32
##
##
## Linsear Write
## Parameters: default
## Easy words: 87
## Hard words: 13
## Grade: 11.71
##
##
## Läsbarhetsindex (LIX)
## Parameters: default
## Index: 40.56
## Rating: standard
## Grade: 6
##
##
## Neue Wiener Sachtextformeln
## Parameters: default
## nWS 1: 5.42
## nWS 2: 5.97
## nWS 3: 6.28
## nWS 4: 6.81
##
##
## Readability Index (RIX)
## Parameters: default
## Index: 4.08
## Grade: 9
##
##
## Simple Measure of Gobbledygook (SMOG)
## Parameters: default
## Grade: 12.01
## Age: 17.01
##
##
## Strain Index
## Parameters: default
## Index: 8.45
##
##
## Kuntzsch's Text-Redundanz-Index
## Parameters: default
## Short words: 297
## Punctuation: 71
## Foreign: 0
## Score: -56.22
##
##
## Tuldava's Text Difficulty Formula
## Parameters: default
## Index: 4.43
##
##
## Wheeler-Smith
## Parameters: default
## Score: 62.08
## Grade: > 4
##
## Text language: en
It is really easy to write a summarizer in a few lines of code. The function below takes in a text array and does the needful. Each element of the array is one sentence of the document we wan summarized.
In the function we need to calculate how similar each sentence is to any other one. This could be done using cosine similarity, but here we use another approach, Jaccard similarity. Given two sentences, Jaccard similarity is the ratio of the size of the intersection word set divided by the size of the union set.
A document \(D\) is comprised of \(m\) sentences \(s_i, i=1,2,...,m\), where each \(s_i\) is a set of words. We compute the pairwise overlap between sentences using the Jaccard similarity index:
\[ J_{ij} = J(s_i, s_j) = \frac{|s_i \cap s_j|}{|s_i \cup s_j|} = J_{ji} \]
The overlap is the ratio of the size of the intersect of the two word sets in sentences \(s_i\) and \(s_j\), divided by the size of the union of the two sets. The similarity score of each sentence is computed as the row sums of the Jaccard similarity matrix.
\[ {\cal S}_i = \sum_{j=1}^m J_{ij} \]
Once the row sums are obtained, they are sorted and the summary is the first \(n\) sentences based on the \({\cal S}_i\) values.
# FUNCTION TO RETURN n SENTENCE SUMMARY
# Input: array of sentences (text)
# Output: n most common intersecting sentences
text_summary = function(text, n) {
m = length(text) # No of sentences in input
jaccard = matrix(0,m,m) #Store match index
for (i in 1:m) {
for (j in i:m) {
a = text[i]; aa = unlist(strsplit(a," "))
b = text[j]; bb = unlist(strsplit(b," "))
jaccard[i,j] = length(intersect(aa,bb))/
length(union(aa,bb))
jaccard[j,i] = jaccard[i,j]
}
}
similarity_score = rowSums(jaccard)
res = sort(similarity_score, index.return=TRUE,
decreasing=TRUE)
idx = res$ix[1:n]
summary = text[idx]
}We will use a sample of text that I took from Bloomberg news. It is about the need for data scientists.
url = "data_files/dstext_sample.txt" #You can put any text file or URL here
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=1)
print(length(text[[1]]))## [1] 1
print("ORIGINAL TEXT")## [1] "ORIGINAL TEXT"
print(text)## [1] "THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of big data, the hype around it having surpassed the reality of what it can deliver. Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.” Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data. Data scientists were meant to be the answer to this issue. Indeed, Hal Varian, Chief Economist at Google famously joked that “The sexy job in the next 10 years will be statisticians.” He was clearly right as we are now used to hearing that data scientists are the key to unlocking the value of big data. This has created a huge market for people with these skills. US recruitment agency, Glassdoor, report that the average salary for a data scientist is $118,709 versus $64,537 for a skilled programmer. And a McKinsey study predicts that by 2018, the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and a 1.5 million shortage of managers with the skills to understand and make decisions based on analysis of big data. It’s no wonder that companies are keen to employ data scientists when, for example, a retailer using big data can reportedly increase their margin by more than 60%. However, is it really this simple? Can data scientists actually justify earning their salaries when brands seem to be struggling to realize the promise of big data? Perhaps we are expecting too much of data scientists. May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets. The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole. This theme of centralized vs. decentralized decision-making is one that has long been debated in the management literature. For many organisations a centralized structure helps maintain control over a vast international operation, plus ensures consistency of customer experience. Others, meanwhile, may give managers at a local level decision-making power particularly when it comes to tactical needs. But the issue urgently needs revisiting in the context of big data as the way in which organisations manage themselves around data may well be a key factor for brands in realizing the value of their data assets. Economist and philosopher Friedrich Hayek took the view that organisations should consider the purpose of the information itself. Centralized decision-making can be more cost-effective and co-ordinated, he believed, but decentralization can add speed and local information that proves more valuable, even if the bigger picture is less clear. He argued that organisations thought too highly of centralized knowledge, while ignoring ‘knowledge of the particular circumstances of time and place’. But it is only relatively recently that economists are starting to accumulate data that allows them to gauge how successful organisations organize themselves. One such exercise reported by Tim Harford was carried out by Harvard Professor Julie Wulf and the former chief economist of the International Monetary Fund, Raghuram Rajan. They reviewed the workings of large US organisations over fifteen years from the mid-80s. What they found was successful companies were often associated with a move towards decentralisation, often driven by globalisation and the need to react promptly to a diverse and swiftly-moving range of markets, particularly at a local level. Their research indicated that decentralisation pays. And technological advancement often goes hand-in-hand with decentralization. Data analytics is starting to filter down to the department layer, where executives are increasingly eager to trawl through the mass of information on offer. Cloud computing, meanwhile, means that line managers no longer rely on IT teams to deploy computer resources. They can do it themselves, in just minutes. The decentralization trend is now impacting on technology spending. According to Gartner, chief marketing officers have been given the same purchasing power in this area as IT managers and, as their spending rises, so that of data centre managers is falling. Tim Harford makes a strong case for the way in which this decentralization is important given that the environment in which we operate is so unpredictable. Innovation typically comes, he argues from a “swirling mix of ideas not from isolated minds.” And he cites Jane Jacobs, writer on urban planning– who suggested we find innovation in cities rather than on the Pacific islands. But this approach is not necessarily always adopted. For example, research by academics Donald Marchand and Joe Peppard discovered that there was still a tendency for brands to approach big data projects the same way they would existing IT projects: i.e. using centralized IT specialists with a focus on building and deploying technology on time, to plan, and within budget. The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e. how do people actually deliver value from data assets. Marchand and Peppard suggest (among other recommendations) that those who need to be able to create meaning from data should be at the heart of any initiative. As ever then, the real value from data comes from asking the right questions of the data. And the right questions to ask only emerge if you are close enough to the business to see them. Are data scientists earning their salary? In my view they are a necessary but not sufficient part of the solution; brands need to be making greater investment in working with a greater range of users to help them ask questions of the data. Which probably means that data scientists’ salaries will need to take a hit in the process."
text2 = strsplit(text,". ",fixed=TRUE) #Special handling of the period.
text2 = text2[[1]]
print("SENTENCES")## [1] "SENTENCES"
print(text2)## [1] "THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of big data, the hype around it having surpassed the reality of what it can deliver"
## [2] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.” Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
## [3] "Data scientists were meant to be the answer to this issue"
## [4] "Indeed, Hal Varian, Chief Economist at Google famously joked that “The sexy job in the next 10 years will be statisticians.” He was clearly right as we are now used to hearing that data scientists are the key to unlocking the value of big data"
## [5] "This has created a huge market for people with these skills"
## [6] "US recruitment agency, Glassdoor, report that the average salary for a data scientist is $118,709 versus $64,537 for a skilled programmer"
## [7] "And a McKinsey study predicts that by 2018, the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and a 1.5 million shortage of managers with the skills to understand and make decisions based on analysis of big data"
## [8] " It’s no wonder that companies are keen to employ data scientists when, for example, a retailer using big data can reportedly increase their margin by more than 60%"
## [9] " However, is it really this simple? Can data scientists actually justify earning their salaries when brands seem to be struggling to realize the promise of big data? Perhaps we are expecting too much of data scientists"
## [10] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"
## [11] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"
## [12] "This theme of centralized vs"
## [13] "decentralized decision-making is one that has long been debated in the management literature"
## [14] " For many organisations a centralized structure helps maintain control over a vast international operation, plus ensures consistency of customer experience"
## [15] "Others, meanwhile, may give managers at a local level decision-making power particularly when it comes to tactical needs"
## [16] " But the issue urgently needs revisiting in the context of big data as the way in which organisations manage themselves around data may well be a key factor for brands in realizing the value of their data assets"
## [17] "Economist and philosopher Friedrich Hayek took the view that organisations should consider the purpose of the information itself"
## [18] "Centralized decision-making can be more cost-effective and co-ordinated, he believed, but decentralization can add speed and local information that proves more valuable, even if the bigger picture is less clear"
## [19] " He argued that organisations thought too highly of centralized knowledge, while ignoring ‘knowledge of the particular circumstances of time and place’"
## [20] "But it is only relatively recently that economists are starting to accumulate data that allows them to gauge how successful organisations organize themselves"
## [21] "One such exercise reported by Tim Harford was carried out by Harvard Professor Julie Wulf and the former chief economist of the International Monetary Fund, Raghuram Rajan"
## [22] "They reviewed the workings of large US organisations over fifteen years from the mid-80s"
## [23] "What they found was successful companies were often associated with a move towards decentralisation, often driven by globalisation and the need to react promptly to a diverse and swiftly-moving range of markets, particularly at a local level"
## [24] "Their research indicated that decentralisation pays"
## [25] "And technological advancement often goes hand-in-hand with decentralization"
## [26] "Data analytics is starting to filter down to the department layer, where executives are increasingly eager to trawl through the mass of information on offer"
## [27] "Cloud computing, meanwhile, means that line managers no longer rely on IT teams to deploy computer resources"
## [28] "They can do it themselves, in just minutes"
## [29] " The decentralization trend is now impacting on technology spending"
## [30] "According to Gartner, chief marketing officers have been given the same purchasing power in this area as IT managers and, as their spending rises, so that of data centre managers is falling"
## [31] "Tim Harford makes a strong case for the way in which this decentralization is important given that the environment in which we operate is so unpredictable"
## [32] "Innovation typically comes, he argues from a “swirling mix of ideas not from isolated minds.” And he cites Jane Jacobs, writer on urban planning– who suggested we find innovation in cities rather than on the Pacific islands"
## [33] "But this approach is not necessarily always adopted"
## [34] "For example, research by academics Donald Marchand and Joe Peppard discovered that there was still a tendency for brands to approach big data projects the same way they would existing IT projects: i.e"
## [35] "using centralized IT specialists with a focus on building and deploying technology on time, to plan, and within budget"
## [36] "The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e"
## [37] "how do people actually deliver value from data assets"
## [38] "Marchand and Peppard suggest (among other recommendations) that those who need to be able to create meaning from data should be at the heart of any initiative"
## [39] "As ever then, the real value from data comes from asking the right questions of the data"
## [40] "And the right questions to ask only emerge if you are close enough to the business to see them"
## [41] "Are data scientists earning their salary? In my view they are a necessary but not sufficient part of the solution; brands need to be making greater investment in working with a greater range of users to help them ask questions of the data"
## [42] "Which probably means that data scientists’ salaries will need to take a hit in the process."
print("SUMMARY")## [1] "SUMMARY"
res = text_summary(text2,5)
print(res)## [1] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.” Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
## [2] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"
## [3] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"
## [4] "The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e"
## [5] "Which probably means that data scientists’ salaries will need to take a hit in the process."
In this segment we explore various text mining research in the field of finance.
Lu, Chen, Chen, Hung, and Li (2010) categorize finance related textual content into three categories: (a) forums, blogs, and wikis; (b) news and research reports; and (c) content generated by firms.
Extracting sentiment and other information from messages posted to stock message boards such as Yahoo!, Motley Fool, Silicon Investor, Raging Bull, etc., see Tumarkin and Whitelaw (2001), Antweiler and Frank (2004), Antweiler and Frank (2005), Das, Martinez-Jerez and Tufano (2005), Das and Chen (2007).
Other news sources: Lexis-Nexis, Factiva, Dow Jones News, etc., see Das, Martinez-Jerez and Tufano (2005); Boudoukh, Feldman, Kogan, Richardson (2012).
The Heard on the Street column in the Wall Street Journal has been used in work by Tetlock (2007), Tetlock, Saar-Tsechansky and Macskassay (2008); see also the use of Wall Street Journal articles by Lu, Chen, Chen, Hung, and Li (2010).
Thomson-Reuters NewsScope Sentiment Engine (RNSE) based on Infonics/Lexalytics algorithms and varied data on stocks and text from internal databases, see Leinweber and Sisk (2011). Zhang and Skiena (2010) develop a market neutral trading strategy using news media such as tweets, over 500 newspapers, Spinn3r RSS feeds, and LiveJournal.
Bollen, Mao, and Zeng (2010) claimed that stock direction of the Dow Jones Industrial Average can be predicted using tweets with 87.6% accuracy.
Bar-Haim, Dinur, Feldman, Fresko and Goldstein (2011) attempt to predict stock direction using tweets by detecting and overweighting the opinion of expert investors.
Brown (2012) looks at the correlation between tweets and the stock market via several measures.
Logunov (2011) uses OpinionFinder to generate many measures of sentiment from tweets.
Twitter based sentiment developed by Rao and Srivastava (2012) is found to be highly correlated with stock prices and indexes, as high as 0.88 for returns.
Sprenger and Welpe (2010) find that tweet bullishness is associated with abnormal stock returns and tweet volume predicts trading volume.
Zhang and Skiena (2010) use Twitter feeds and also three other sources of text: over 500 nationwide newspapers, RSS feeds from blogs, and LiveJournal blogs. These are used to compute two metrics.
\[ \begin{eqnarray*} \mbox{polarity} &=& \frac{n_{pos} - n_{neg}}{n_{pos} + n_{neg}} \\ \mbox{subjectivity} &=& \frac{n_{pos} + n_{neg}}{N} \end{eqnarray*} \]
where \(N\) is the total number of words in a text document, \(n_{pos}, n_{neg}\) are the number of positive and negative words, respectively.
They find that the number of articles is predictive of trading volume.
Subjectivity is also predictive of trading volume, lending credence to the idea that differences of opinion make markets.
Stock return prediction is weak using polarity, but tweets do seem to have some predictive power.
Various sentiment driven market neutral strategies are shown to be profitable, though the study is not tested for robustness.
Logunov (2011) uses tweets data, and applies OpinionFinder and also developed a new classifier called Naive Emoticon Classification to encode sentiment. This is an unusual and original, albeit quite intuitive use of emoticons to determine mood in text mining. If an emoticon exists, then the tweet is automatically coded with that sentiment of emotion. Four types of emoticons are considered: Happy (H), Sad (S), Joy (J), and Cry (C). Polarity is defined here as \[ \mbox{polarity} = A = \frac{n_H + n_J}{n_H + n_S + n_J + n_C} \] Values greater than 0.5 are positive. \(A\) stands for aggregate sentiment and appears to be strongly autocorrelated. Overall, prediction evidence is weak.
Text analysis is undertaken across companies in a cross-section.
The quality of text in company reports is much better than in message postings.
Textual analysis in this area has also resulted in technical improvements. Rudimentary approaches such as word count methods have been extended to weighted schemes, where weights are determined in statistical ways. In Das and Chen (2007), the discriminant score of each word across classification categories is used as a weighting index for the importance of words.
There is a proliferation of word-weighting schemes.The idea of ``inverse document frequency’’ (\(idf\)) as a weighting coefficient. Hence, the \(idf\) for word \(j\) would be
\[ w_j^{idf} = \ln \left( \frac{N}{df_j} \right) \] where \(N\) is the total number of documents, and \(df_j\) is the number of documents containing word \(j\). This scheme was proposed by Manning and Schutze (1999).
Loughran and McDonald (2011) use this weighting approach to modify the word (term) frequency counts in the documents they analyze. The weight on word \(j\) in document \(i\) is specified as \[ w_{ij} = \max[0,1 + \ln(f_{ij}) w_{j}^{idf}] \] where \(f_{ij}\) is the frequency count of word \(j\) in document \(i\). This leads naturally to a document score of \[ S_i^{LM} = \frac{1}{1+\ln(a_i)} \sum_{j=1}^J w_{ij} \] Here \(a_i\) is the total number of words in document \(i\), and \(J\) is the total number of words in the lexicon. (The \(LM\) superscript signifies the weighting approach.)
Whereas the \(idf\) approach is intuitive, it does not have to be relevant for market activity. An alternate and effective weighting scheme has been developed in Jegadeesh and Wu (2013, JW) using market movements. Words that occur more often on large market move days are given a greater weight than other words. JW show that this scheme is superior to an unweighted one, and delivers an accurate system for determining the ``tone’’ of a regulatory filing.
JW also conduct robustness checks that suggest that the approach is quite general, and applies to other domains with no additional modifications to the specification. Indeed, they find that tone extraction from 10-Ks may be used to predict IPO underpricing.
Jegadeesh and Wu (2013) create a ``global lexicon’’ merging multiple word lists from Harvard-IV-4 Psychological Dictionaries(Harvard Inquirer), the Lasswell Value Dictionary, the Loughran and McDonald lists, and the word list in Bradley and Lang (1999). They test this lexicon for robustness by checking (a) that the lexicon delivers accurate tone scores and (b) that it is complete by discarding 50% of the words and seeing whether it causes a material change in results (it does not).
This approach provides a more reliable measure of document tone than preceding approaches. Their measure of filing tone is statistically related to filing period returns after providing for reasonable control variables. Tone is significantly related to returns for up to two weeks after filing, and it appears that the market under reacts to tone, and this is corrected within this two week window.
The tone score of document \(i\) in the JW paper is specified as \[ S_i^{JW} = \frac{1}{a_i} \sum_{j=1}^J w_j f_{ij} \] where \(w_j\) is the weight for word \(j\) based on its relationship to market movement. (The \(JW\) superscript signifies the weighting approach.)
The following regression is used to determine the value of \(w_j\) (across all documents). \[ \begin{eqnarray*} r_i &=& a + b \cdot S_j^{JW} + \epsilon_i \\ &=& a + b \left( \frac{1}{a_i} \sum_{j=1}^J w_j f_{ij} \right) + \epsilon_i \\ &=& a + \left( \frac{1}{a_i} \sum_{j=1}^J (b w_j) f_{ij} \right) + \epsilon_i \\ &=& a + \left( \frac{1}{a_i} \sum_{j=1}^J B_j f_{ij} \right) + \epsilon_i \end{eqnarray*} \] where \(r_i\) is the abnormal return around the release of document \(i\), and \(B_j=b w_j\) is a modified word weight. This is then translated back into the original estimated word weight by normalization, i.e., \[ w_j = \frac{B_j - \frac{1}{J}\sum_{j=1}^J B_j}{\sigma(B_j)} \] where \(\sigma(B_j)\) is the standard deviation of \(B_j\) across all \(J\) words in the lexicon.
Abnormal return \(r_i\) is defined as the three-day excess return over the CRSP value-weighted return. \[ r_i = \prod_{t=0}^3 ret_{it} - \prod_{t=1}^3 ret_{VW,t} \] Instead of \(r_i\) as the left-hand side variable in the regression, one might also use a binary variable for good and bad news, positive or negative 10-Ks, etc., and instead of the regression we would use a limited dependent variable structure such as logit, probit, or even a Bayes classifier. However, the advantages of \(r_i\) being a continuous variable are considerable for it offers a range of outcomes, and simpler regression fit.
JW use data from 10-K filings over the period 1995–2010 extracted from SEC’s EDGAR database. They ignore positive and negative words when a negator occurs within a distance of three words, the negators being the words ``not, no, never’’.
Word weight scores are computed for the entire sample, and also for three roughly equal concatenated subperiods. The correlation of word weights across these subperiods is high, around 0.50 on average. Hence, the word weights appear to be quite stable over time and different economic regimes. As would be expected, when two subperiods are used the correlation of word weights is higher, suggesting that longer samples deliver better weighting scores. Interestingly, the correlation of JW scores with LM \(idf\) scores is low, and therefore, they are not substitutes.
JW examine the market variables that determine document score \(S_i^{JW}\) for each 10-K with right-hand side variables as the size of the firm, book-to-market, volatility, turnover, three day excess return over CRSP VW around earnings announcements, and accruals. Both positive and negative tone are significantly related to size and BM, suggesting that risk factors are captured in score.
Volatility is also significant and has the correct sign, i.e., that increases in volatility make negative tone more negative and positive tone less positive.
The same holds for turnover, in that more turnover makes tone pessimistic. The greater the earnings announcement abnormal return, the higher the tone, though this is not significant. Accruals do not significantly relate to score.
When regressing filing period return on document score and other controls (same as in the previous paragraph), the score is always statistically significant. Hence text in the 10-Ks does correlate with the market’s view of the firm after incorporating the information in the 10-K and from other sources.
Finally, JW find a negative relation between tone and IPO underpricing, suggesting that term weights from one domain can be reliably used in a different domain.
When using company filings, it is often an important issue as to whether to use the entire text of the filing or not. Sharper conclusions may be possible from specific sections of the filing such as a 10-K. Loughran and McDonald (2011) examined whether the Management Discussion and Analysis (MD&A) section of the filing was better at providing tone (sentiment) then the entire 10-K. They found not.
They also showed that using their six tailor-made word lists gave better results for detecting tone than did the Harvard Inquirer words. And as discussed earlier, proper word-weighting also improves tone detection. Their word lists also worked well in detecting tone for seasoned equity offerings and news articles, providing good correlation with returns.
Loughran and McDonald (2014) examine the readability of financial documents, by surveying at the text in 10-K filings. They compute the Fog index for these documents and compare this to post filing measures of the information environment such as volatility of returns, dispersion of analysts recommendations. When the text is readable, then there should be less dispersion in the information environment, i.e., lower volatility and lower dispersion of analysts expectations around the release of the 10-K.
Whereas they find that the Fog index does not seem to correlate well with these measures of the information environment, the file size of the 10-K is a much better measure and is significantly related to return volatility, earnings forecast errors, and earnings forecast dispersion, after accounting for control variates such as size, book-to-market, lagged volatility, lagged return, and industry effects.
Li (2008) also shows that 10-Ks with high Fog index and longer length have lower subsequent earnings. Thus managers with poor performance may try to hide this by increasing the complexity of their documents, mostly by increasing the size of their filings.
The readability of business documents has caught the attention of many researchers, and not unexpectedly, in the accounting area. DeFranco et al (2013) combine the Fog, Flesh-Kincaid, and Flesch scores to show that higher readability of analyst’s reports is related to higher trading volume, suggesting that a better information environment induces people to trade more and not shy away from the market.
Lehavy et al (2011) show that a greater Fog index on 10-Ks is correlated with greater analyst following, more analyst dispersion, and lower accuracy of their forecasts. Most of the literature focuses on 10-Ks because these are deemed the most information to investors, but it would be interesting to see if readability is any different when looking at shorter documents such as 10-Qs. Whether the simple, dominant (albeit language independent) measure of file size remains a strong indicator of readability remains to be seen in documents other than 10-Ks.
Another examination of 10-K text appears in Bodnaruk et al (2013). Here, the authors measure the percentage of negative words in 10-Ks to see if this is an indicator of financial constraints that improves on existing measures. There is low correlation of this measure with size, where bigger firms are widely posited to be less financially constrained. But, an increase in the percentage of negative words suggests an inflection point indicating the tendency of a firm to lapse into a state of financial constraint. Using control variables such as market capitalization, prior returns, and a negative earnings indicator, percentage negative words helps more in identifying which firm will be financially constrained than widely used constraint indexes. The negative word count is useful in that it is independent of the way in which the filing is written, and picks up cues from managers who tend to use more negative words.
The number of negative words is useful in predicting liquidity events such as dividend cuts or omissions, downgrades, and asset growth. A one standard deviation increase in negative words increases the likelihood of a dividend omission by 8.9% and a debt downgrade by 10.8%. An obvious extension of this work would be to see whether default probability models may be enhanced by using the percentage of negative words as an explanatory variable.
Sprenger (2011) integrates data from text classification of tweets, user voting, and a proprietary stock game to extract the bullishness of online investors; these ideas are behind the site http://TweetTrader.net.
Tweets also pose interesting problems of big streaming data discussed in Pervin, Fang, Datta, and Dutta (2013).
Data used here is from filings such as 10-Ks, etc., (Loughran and McDonald (2011); Burdick et al (2011); Bodnaruk, Loughran, and McDonald (2013); Jegadeesh and Wu (2013); Loughran and McDonald (2014)).
Wysocki (1999) found that for the 50 top firms in message posting volume on Yahoo! Finance, message volume predicted next day abnormal stock returns. Using a broader set of firms, he also found that high message volume firms were those with inflated valuations (relative to fundamentals), high trading volume, high short seller activity (given possibly inflated valuations), high analyst following (message posting appears to be related to news as well, correlated with a general notion of “attention” stocks), and low institutional holdings (hence broader investor discussion and interest), all intuitive outcomes.
Bagnoli, Beneish, and Watts (1999) examined earnings “whispers”, unofficial crowd-sourced forecasts of quarterly earnings from small investors, are more accurate than that of First Call analyst forecasts.
Tumarkin and Whitelaw (2001) examined self-reported sentiment on the Raging Bull message board and found no predictive content, either of returns or volume.
Antweiler and Frank (2004) used the Naive Bayes algorithm for classification, implemented in the {Rainbow} package of Andrew McCallum (1996). They also repeated the same using Support Vector Machines (SVMs) as a robustness check. Both algorithms generate similar empirical results. Once the algorithm is trained, they use it out-of-sample to sign each message as \(\{Buy, Hold, Sell\}\). Let \(n_B, n_S\) be the number of buy and sell messages, respectively. Then \(R = n_B/n_S\) is just the ration of buy to sell messages. Based on this they define their bullishness index
\[ B = \frac{n_B - n_S}{n_B + n_S} = \frac{R-1}{R+1} \in (-1,+1) \]
This metric is independent of the number of messages, i.e., is homogenous of degree zero in \(n_B,n_S\). An alternative measure is also proposed, i.e.,
\[ \begin{eqnarray*} B^* &=& \ln\left[\frac{1+n_B}{1+n_S} \right] \\ &=& \ln\left[\frac{1+R(1+n_B+n_S)}{1+R+n_B+n_S} \right] \\ &=& \ln\left[\frac{2+(n_B+n_S)(1+B)}{2+(n_B+n_S)(1-B)} \right] \\ & \approx & B \cdot \ln(1+n_B+n_S) \end{eqnarray*} \]
This measure takes the bullishness index \(B\) and weights it by the number of messages of both categories. This is homogenous of degree between zero and one. And they also propose a third measure, which is much more direct, i.e.,
\[ B^{**} = n_B - n_S = (n_B+n_S) \cdot \frac{R-1}{R+1} = M \cdot B \]
which is homogenous of degree one, and is a message weighted bullishness index. They prefer to use \(B^*\) in their algorithms as it appears to deliver the best predictive results. Finally, produce an agreement index,
\[ A = 1 - \sqrt{1-B^2} \in (0,1) \]
Note how closely this is related to the disagreement index seen earlier.
The bullishness index does not predict returns, but returns do explain message posting. More messages are posted in periods of negative returns, but this is not a significant relationship.
A contemporaneous relation between returns and bullishness is present. Overall, \(AF04\) present some important results that are indicative of the results in this literature, confirmed also in subsequent work.
First, that message board postings do not predict returns.
Second, that disagreement (measured from postings) induces trading.
Third, message posting does predict volatility at daily frequencies and intraday.
Fourth, messages reflect public information rapidly. Overall, they conclude that stock chat is meaningful in content and not just noise.
An illustrative list of applications for finance firms is as follows:
Latent Semantic Analysis (LSA) is an approach for reducing the dimension of the Term-Document Matrix (TDM), or the corresponding Document-Term Matrix (DTM), in general used interchangeably, unless a specific one is invoked. Dimension reduction of the TDM offers two benefits:
The DTM is usually a sparse matrix, and sparseness means that our algorithms have to work harder on missing data, which is clearly wasteful. Some of this sparseness is attenuated by applying LSA to the TDM.
The problem of synonymy also exists in the TDM, which usually contains thousands of terms (words). Synonymy arises becauses many words have similar meanings, i.e., redundancy exists in the list of terms. LSA mitigates this redundancy, as we shall see through the ensuing anaysis of LSA.
While not precisely the same thing, think of LSA in the text domain as analogous to PCA in the data domain.
LSA is the application of Singular Value Decomposition (SVD) to the TDM, extracted from a text corpus. Define the TDM to be a matrix \(M \in {\cal R}^{m \times n}\), where \(m\) is the number of terms and \(n\) is the number of documents.
The SVD of matrix \(M\) is given by \[ M = T \cdot S \cdot D^\top \] where \(T \in {\cal R}^{m \times n}\) and \(D \in {\cal R}^{n \times n}\) are orthonormal to each other, and \(S \in {\cal R}^{n \times n}\) is the “singluar values” matrix, i.e., a diagonal matrix with singular values on the diagonal. These values denote the relative importance of the terms in the TDM.
Create a temporary directory and add some documents to it. This is a modification of the example in the lsa package
system("mkdir D")
write( c("blue", "red", "green"), file=paste("D", "D1.txt", sep="/"))
write( c("black", "blue", "red"), file=paste("D", "D2.txt", sep="/"))
write( c("yellow", "black", "green"), file=paste("D", "D3.txt", sep="/"))
write( c("yellow", "red", "black"), file=paste("D", "D4.txt", sep="/"))Create a TDM using the textmatrix function.
library(lsa)
tdm = textmatrix("D",minWordLength=1)
print(tdm)## docs
## terms D1.txt D2.txt D3.txt D4.txt
## blue 1 1 0 0
## green 1 0 1 0
## red 1 1 0 1
## black 0 1 1 1
## yellow 0 0 1 1
Remove the extra directory.
system("rm -rf D")SVD tries to connect the correlation matrix of terms (\(M \cdot M^\top\)) with the correlation matrix of documents (\(M^\top \cdot M\)) through the singular matrix.
To see this connection, note that matrix \(T\) contains the eigenvectors of the correlation matrix of terms. Likewise, the matrix \(D\) contains the eigenvectors of the correlation matrix of documents. To see this, let’s compute
et = eigen(tdm %*% t(tdm))$vectors
print(et)## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.3629044 -6.015010e-01 -0.06829369 3.717480e-01 0.6030227
## [2,] -0.3328695 -2.220446e-16 -0.89347008 5.551115e-16 -0.3015113
## [3,] -0.5593741 -3.717480e-01 0.31014767 -6.015010e-01 -0.3015113
## [4,] -0.5593741 3.717480e-01 0.31014767 6.015010e-01 -0.3015113
## [5,] -0.3629044 6.015010e-01 -0.06829369 -3.717480e-01 0.6030227
ed = eigen(t(tdm) %*% tdm)$vectors
print(ed)## [,1] [,2] [,3] [,4]
## [1,] -0.4570561 0.601501 -0.5395366 -0.371748
## [2,] -0.5395366 0.371748 0.4570561 0.601501
## [3,] -0.4570561 -0.601501 -0.5395366 0.371748
## [4,] -0.5395366 -0.371748 0.4570561 -0.601501
If we wish to reduce the dimension of the latent semantic space to \(k < n\) then we use only the first \(k\) eigenvectors. The lsa function does this automatically.
We call LSA and ask it to automatically reduce the dimension of the TDM using a built-in function dimcalc_share.
res = lsa(tdm,dims=dimcalc_share())
print(res)## $tk
## [,1] [,2]
## blue -0.3629044 -6.015010e-01
## green -0.3328695 -5.551115e-17
## red -0.5593741 -3.717480e-01
## black -0.5593741 3.717480e-01
## yellow -0.3629044 6.015010e-01
##
## $dk
## [,1] [,2]
## D1.txt -0.4570561 -0.601501
## D2.txt -0.5395366 -0.371748
## D3.txt -0.4570561 0.601501
## D4.txt -0.5395366 0.371748
##
## $sk
## [1] 2.746158 1.618034
##
## attr(,"class")
## [1] "LSAspace"
We can see that the dimension has been reduced from \(n=4\) to \(n=2\). The output is shown for both the term matrix and the document matrix, both of which have only two columns. Think of these as the two “principal semantic components” of the TDM.
Compare the output of the LSA to the eigenvectors above to see that it is exactly that. The singular values in the ouput are connected to SVD as follows.
First of all we see that the lsa function is nothing but the svd function in base R.
res2 = svd(tdm)
print(res2)## $d
## [1] 2.746158 1.618034 1.207733 0.618034
##
## $u
## [,1] [,2] [,3] [,4]
## [1,] -0.3629044 -6.015010e-01 0.06829369 3.717480e-01
## [2,] -0.3328695 -5.551115e-17 0.89347008 -3.455569e-15
## [3,] -0.5593741 -3.717480e-01 -0.31014767 -6.015010e-01
## [4,] -0.5593741 3.717480e-01 -0.31014767 6.015010e-01
## [5,] -0.3629044 6.015010e-01 0.06829369 -3.717480e-01
##
## $v
## [,1] [,2] [,3] [,4]
## [1,] -0.4570561 -0.601501 0.5395366 -0.371748
## [2,] -0.5395366 -0.371748 -0.4570561 0.601501
## [3,] -0.4570561 0.601501 0.5395366 0.371748
## [4,] -0.5395366 0.371748 -0.4570561 -0.601501
The output here is the same as that of LSA except it is provided for \(n=4\). So we have four columns in \(T\) and \(D\) rather than two. Compare the results here to the previous two slides to see the connection.
We may reconstruct the TDM using the result of the LSA.
tdm_lsa = res$tk %*% diag(res$sk) %*% t(res$dk)
print(tdm_lsa)## D1.txt D2.txt D3.txt D4.txt
## blue 1.0409089 0.8995016 -0.1299115 0.1758948
## green 0.4178005 0.4931970 0.4178005 0.4931970
## red 1.0639006 1.0524048 0.3402938 0.6051912
## black 0.3402938 0.6051912 1.0639006 1.0524048
## yellow -0.1299115 0.1758948 1.0409089 0.8995016
We see the new TDM after the LSA operation, it has non-integer frequency counts, but it may be treated in the same way as the original TDM. The document vectors populate a slightly different hyperspace.
LSA reduces the rank of the correlation matrix of terms \(M \cdot M^\top\) to \(n=2\). Here we see the rank before and after LSA.
library(Matrix)## Warning: package 'Matrix' was built under R version 3.2.5
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:qdap':
##
## %&%
print(rankMatrix(tdm))## [1] 4
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.110223e-15
print(rankMatrix(tdm_lsa))## [1] 2
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.110223e-15
It is similar to LSA, in that it seeks to find the most related words and cluster them into topics. It uses a Bayesian approach to do this, but more on that later. Here, let’s just do an example to see how we might use the topicmodels package.
#Load the package
library(topicmodels)## Warning: package 'topicmodels' was built under R version 3.2.5
#Load data on news articles from Associated Press
data(AssociatedPress)
print(dim(AssociatedPress))## [1] 2246 10473
This is a large DTM (not TDM). It has more than 10,000 terms, and more than 2,000 documents. This is very large and LDA will take some time, so let’s run it on a subset of the documents.
dtm = AssociatedPress[1:100,]
dim(dtm)## [1] 100 10473
#Set parameters for Gibbs sampling
burnin = 4000
iter = 2000
thin = 500
seed = list(2003,5,63,100001,765)
nstart = 5
best = TRUE
#Number of topics
k = 5#Run LDA
res <-LDA(dtm, k, method="Gibbs", control = list(nstart = nstart, seed = seed, best = best, burnin = burnin, iter = iter, thin = thin))
#Show topics
res.topics = as.matrix(topics(res))
print(res.topics)## [,1]
## [1,] 5
## [2,] 4
## [3,] 5
## [4,] 1
## [5,] 1
## [6,] 4
## [7,] 2
## [8,] 1
## [9,] 5
## [10,] 5
## [11,] 5
## [12,] 3
## [13,] 1
## [14,] 4
## [15,] 2
## [16,] 3
## [17,] 1
## [18,] 1
## [19,] 2
## [20,] 3
## [21,] 5
## [22,] 2
## [23,] 2
## [24,] 1
## [25,] 2
## [26,] 4
## [27,] 4
## [28,] 2
## [29,] 4
## [30,] 3
## [31,] 2
## [32,] 1
## [33,] 4
## [34,] 1
## [35,] 5
## [36,] 4
## [37,] 1
## [38,] 4
## [39,] 4
## [40,] 2
## [41,] 2
## [42,] 2
## [43,] 1
## [44,] 1
## [45,] 5
## [46,] 3
## [47,] 2
## [48,] 3
## [49,] 1
## [50,] 4
## [51,] 1
## [52,] 2
## [53,] 3
## [54,] 1
## [55,] 3
## [56,] 4
## [57,] 4
## [58,] 2
## [59,] 5
## [60,] 2
## [61,] 2
## [62,] 3
## [63,] 2
## [64,] 1
## [65,] 2
## [66,] 4
## [67,] 5
## [68,] 2
## [69,] 4
## [70,] 5
## [71,] 5
## [72,] 5
## [73,] 2
## [74,] 5
## [75,] 2
## [76,] 1
## [77,] 1
## [78,] 1
## [79,] 3
## [80,] 5
## [81,] 1
## [82,] 3
## [83,] 5
## [84,] 3
## [85,] 3
## [86,] 5
## [87,] 2
## [88,] 5
## [89,] 2
## [90,] 5
## [91,] 3
## [92,] 1
## [93,] 1
## [94,] 4
## [95,] 3
## [96,] 4
## [97,] 4
## [98,] 4
## [99,] 5
## [100,] 5
#Show top terms
res.terms = as.matrix(terms(res,10))
print(res.terms)## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "i" "percent" "new" "soviet" "police"
## [2,] "people" "year" "york" "government" "central"
## [3,] "state" "company" "expected" "official" "man"
## [4,] "years" "last" "states" "two" "monday"
## [5,] "bush" "new" "officials" "union" "friday"
## [6,] "president" "bank" "program" "officials" "city"
## [7,] "get" "oil" "california" "war" "four"
## [8,] "told" "prices" "week" "president" "school"
## [9,] "administration" "report" "air" "world" "high"
## [10,] "dukakis" "million" "help" "leaders" "national"
#Show topic probabilities
res.topicProbs = as.data.frame(res@gamma)
print(res.topicProbs)## V1 V2 V3 V4 V5
## 1 0.19169329 0.06070288 0.04472843 0.10223642 0.60063898
## 2 0.12149533 0.14330218 0.08099688 0.58255452 0.07165109
## 3 0.27213115 0.04262295 0.05901639 0.07868852 0.54754098
## 4 0.29571984 0.16731518 0.19844358 0.19455253 0.14396887
## 5 0.31896552 0.15517241 0.20689655 0.14655172 0.17241379
## 6 0.30360934 0.08492569 0.08492569 0.46284501 0.06369427
## 7 0.17050691 0.40092166 0.15668203 0.17050691 0.10138249
## 8 0.37142857 0.15238095 0.14285714 0.20000000 0.13333333
## 9 0.19298246 0.17543860 0.19298246 0.19298246 0.24561404
## 10 0.19879518 0.16265060 0.17469880 0.18674699 0.27710843
## 11 0.21212121 0.20202020 0.16161616 0.15151515 0.27272727
## 12 0.20143885 0.15827338 0.25899281 0.17985612 0.20143885
## 13 0.41395349 0.16279070 0.18139535 0.12558140 0.11627907
## 14 0.17948718 0.17948718 0.12820513 0.30769231 0.20512821
## 15 0.05135952 0.78247734 0.06344411 0.06042296 0.04229607
## 16 0.09770115 0.24712644 0.35632184 0.14942529 0.14942529
## 17 0.43103448 0.18103448 0.09051724 0.10775862 0.18965517
## 18 0.67857143 0.04591837 0.06377551 0.08418367 0.12755102
## 19 0.07083333 0.70000000 0.08750000 0.07500000 0.06666667
## 20 0.15196078 0.05637255 0.69117647 0.04656863 0.05392157
## 21 0.21782178 0.11881188 0.12871287 0.15841584 0.37623762
## 22 0.16666667 0.30000000 0.16666667 0.16666667 0.20000000
## 23 0.19298246 0.21052632 0.17543860 0.21052632 0.21052632
## 24 0.31775701 0.20560748 0.16822430 0.18691589 0.12149533
## 25 0.05121951 0.65121951 0.15365854 0.08536585 0.05853659
## 26 0.11740891 0.09311741 0.08502024 0.37246964 0.33198381
## 27 0.06583072 0.05956113 0.10658307 0.68338558 0.08463950
## 28 0.15068493 0.30136986 0.12328767 0.26027397 0.16438356
## 29 0.07860262 0.04148472 0.05676856 0.68995633 0.13318777
## 30 0.13968254 0.17142857 0.46031746 0.07936508 0.14920635
## 31 0.08405172 0.74784483 0.07112069 0.05172414 0.04525862
## 32 0.66137566 0.10846561 0.06349206 0.07407407 0.09259259
## 33 0.14655172 0.18103448 0.15517241 0.41379310 0.10344828
## 34 0.29605263 0.19736842 0.21052632 0.13157895 0.16447368
## 35 0.08080808 0.05050505 0.10437710 0.07070707 0.69360269
## 36 0.13333333 0.07878788 0.08484848 0.46666667 0.23636364
## 37 0.46202532 0.08227848 0.12974684 0.16139241 0.16455696
## 38 0.09442060 0.07296137 0.12017167 0.64377682 0.06866953
## 39 0.11764706 0.08359133 0.10526316 0.62538700 0.06811146
## 40 0.10869565 0.56521739 0.14492754 0.07246377 0.10869565
## 41 0.07671958 0.43650794 0.16137566 0.25396825 0.07142857
## 42 0.11445783 0.57831325 0.11445783 0.09036145 0.10240964
## 43 0.55793991 0.10944206 0.08798283 0.09442060 0.15021459
## 44 0.40939597 0.10067114 0.22818792 0.12751678 0.13422819
## 45 0.20000000 0.15121951 0.12682927 0.25853659 0.26341463
## 46 0.14828897 0.11406844 0.56653992 0.08365019 0.08745247
## 47 0.09929078 0.41134752 0.13475177 0.22695035 0.12765957
## 48 0.20129870 0.07467532 0.54870130 0.10714286 0.06818182
## 49 0.46800000 0.09600000 0.18400000 0.10400000 0.14800000
## 50 0.22955145 0.08179420 0.05013193 0.60158311 0.03693931
## 51 0.28368794 0.17730496 0.18439716 0.14893617 0.20567376
## 52 0.12977099 0.45801527 0.12977099 0.18320611 0.09923664
## 53 0.10507246 0.14492754 0.55072464 0.06884058 0.13043478
## 54 0.42647059 0.13725490 0.15196078 0.15686275 0.12745098
## 55 0.11881188 0.19801980 0.44554455 0.08910891 0.14851485
## 56 0.22857143 0.15714286 0.13571429 0.37142857 0.10714286
## 57 0.15294118 0.07058824 0.06117647 0.66823529 0.04705882
## 58 0.11494253 0.49425287 0.14367816 0.12068966 0.12643678
## 59 0.13278008 0.04979253 0.13692946 0.26556017 0.41493776
## 60 0.16666667 0.31666667 0.16666667 0.16666667 0.18333333
## 61 0.06796117 0.73786408 0.08090615 0.04854369 0.06472492
## 62 0.12680115 0.12968300 0.58213256 0.12103746 0.04034582
## 63 0.07902736 0.72948328 0.09118541 0.05471125 0.04559271
## 64 0.44285714 0.12142857 0.14285714 0.13214286 0.16071429
## 65 0.19540230 0.31034483 0.19540230 0.14942529 0.14942529
## 66 0.18518519 0.22222222 0.17037037 0.28888889 0.13333333
## 67 0.07024793 0.07851240 0.08677686 0.04545455 0.71900826
## 68 0.10181818 0.48000000 0.14909091 0.12727273 0.14181818
## 69 0.12307692 0.15384615 0.10000000 0.43076923 0.19230769
## 70 0.12745098 0.07352941 0.14215686 0.13235294 0.52450980
## 71 0.21582734 0.10791367 0.16546763 0.14388489 0.36690647
## 72 0.17560976 0.11219512 0.17073171 0.15609756 0.38536585
## 73 0.12280702 0.46198830 0.07602339 0.23976608 0.09941520
## 74 0.20535714 0.16964286 0.17857143 0.14285714 0.30357143
## 75 0.07567568 0.47027027 0.11891892 0.19459459 0.14054054
## 76 0.67310789 0.15619968 0.07407407 0.05152979 0.04508857
## 77 0.63834423 0.07189542 0.09150327 0.11546841 0.08278867
## 78 0.61504425 0.09292035 0.11946903 0.11504425 0.05752212
## 79 0.10971787 0.07523511 0.65830721 0.07210031 0.08463950
## 80 0.11111111 0.08666667 0.11111111 0.05777778 0.63333333
## 81 0.49681529 0.03821656 0.15286624 0.14437367 0.16772824
## 82 0.20111732 0.17318436 0.24022346 0.15642458 0.22905028
## 83 0.10731707 0.15609756 0.11219512 0.23902439 0.38536585
## 84 0.26016260 0.10569106 0.36585366 0.13008130 0.13821138
## 85 0.11525424 0.10508475 0.39322034 0.30508475 0.08135593
## 86 0.15454545 0.06060606 0.15757576 0.09696970 0.53030303
## 87 0.08301887 0.67924528 0.07924528 0.09433962 0.06415094
## 88 0.16666667 0.15972222 0.22916667 0.11805556 0.32638889
## 89 0.12389381 0.47787611 0.09734513 0.14159292 0.15929204
## 90 0.12389381 0.11061947 0.23008850 0.10176991 0.43362832
## 91 0.19724771 0.11009174 0.30275229 0.16972477 0.22018349
## 92 0.33854167 0.13541667 0.12500000 0.11458333 0.28645833
## 93 0.40131579 0.13815789 0.10526316 0.18421053 0.17105263
## 94 0.06930693 0.10231023 0.09240924 0.67656766 0.05940594
## 95 0.09130435 0.15000000 0.65434783 0.03043478 0.07391304
## 96 0.13370474 0.13091922 0.12256267 0.49303621 0.11977716
## 97 0.06709265 0.06070288 0.11501597 0.60383387 0.15335463
## 98 0.16438356 0.16438356 0.17808219 0.28767123 0.20547945
## 99 0.06274510 0.08235294 0.16470588 0.06666667 0.62352941
## 100 0.11627907 0.20465116 0.11162791 0.16744186 0.40000000
#Check that each term is allocated to all topics
print(rowSums(res.topicProbs))## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Note that the highest probability in each row assigns each document to a topic.
Latent Dirichlet Allocation (LDA) was created by David Blei, Andrew Ng, and Michael Jordan in 2003, see their paper titled “Latent Dirichlet Allocation” in the Journal of Machine Learning Research, pp 993–1022.
The simplest way to think about LDA is as a probability model that connects documents with words and topics. The components are:
Next, we connect the above objects to \(K\) topics, indexed by \(l\), i.e., \(t_l\). We will see that LDA is encapsulated in two matrices: Matrix \(A\) and Matrix \(B\).
\[ p(\theta | \alpha) = \frac{\Gamma(\sum_{l=1}^K \alpha_l)}{\prod_{l=1}^K \Gamma(\alpha_l)} \; \prod_{l=1}^K \theta_l^{\alpha_l - 1} \]
where \(\Gamma(\cdot)\) is the Gamma function. - LDA thus gets its name from the use of the Dirichlet distribution, embodied in Matrix \(A\). Since the topics are latent, it explains the rest of the nomenclature. - Given \(\theta\), we sample topics from matrix \(A\) with probability \(p(t | \theta)\).
\[ p(\theta, {\bf t}, {\bf w}) = p(\theta | \alpha) \prod_{l=1}^K p(t_l | \theta) p(w_l | t_l) \]
\[ p({\bf w}) = \int p(\theta | \alpha) \left(\prod_{l=1}^K \sum_{t_l} p(t_l | \theta) p(w_l | t_l)\; \right) d\theta \]
\[ p(D) = \prod_{j=1}^M \int p(\theta_j | \alpha) \left(\prod_{l=1}^K \sum_{t_{jl}} p(t_l | \theta_j) p(w_l | t_l)\; \right) d\theta_j \]
The goal is to maximize this likelihood by picking the vector \(\alpha\) and the probabilities in the matrix \(B\). (Note that were a Dirichlet distribution not used, then we could directly pick values in Matrices \(A\) and \(B\).)
The computation is undertaken using MCMC with Gibbs sampling as shown in the example earlier.
See the original vignette from which this is abstracted. https://cran.r-project.org/web/packages/text2vec/vignettes/text-vectorization.html
library(text2vec)## Warning: package 'text2vec' was built under R version 3.2.5
##
## Attaching package: 'text2vec'
## The following object is masked from 'package:qdap':
##
## %>%
library(data.table)## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, last
## The following object is masked from 'package:qdapTools':
##
## shift
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
all_ids = movie_review$id
train_ids = sample(all_ids, 4000)
test_ids = setdiff(all_ids, train_ids)
train = movie_review[J(train_ids)]
test = movie_review[J(test_ids)]
print(head(train))## id sentiment
## 1: 11912_2 0
## 2: 11507_10 1
## 3: 8194_9 1
## 4: 11426_10 1
## 5: 4043_3 0
## 6: 11287_3 0
## review
## 1: The story behind this movie is very interesting, and in general the plot is not so bad... but the details: writing, directing, continuity, pacing, action sequences, stunts, and use of CG all cheapen and spoil the film.<br /><br />First off, action sequences. They are all quite unexciting. Most consist of someone standing up and getting shot, making no attempt to run, fight, dodge, or whatever, even though they have all the time in the world. The sequences just seem bland for something made in 2004.<br /><br />The CG features very nicely rendered and animated effects, but they come off looking cheap because of how they are used.<br /><br />Pacing: everything happens too quickly. For example, \\"Elle\\" is trained to fight in a couple of hours, and from the start can do back-flips, etc. Why is she so acrobatic? None of this is explained in the movie. As Lilith, she wouldn't have needed to be able to do back flips - maybe she couldn't, since she had wings.<br /><br />Also, we have sequences like a woman getting run over by a car, and getting up and just wandering off into a deserted room with a sink and mirror, and then stabbing herself in the throat, all for no apparent reason, and without any of the spectators really caring that she just got hit by a car (and then felt the secondary effects of another, exploding car)... \\"Are you okay?\\" asks the driver \\"yes, I'm fine\\" she says, bloodied and disheveled.<br /><br />I watched it all, though, because the introduction promised me that it would be interesting... but in the end, the poor execution made me wish for anything else: Blade, Vampire Hunter D, even that movie with vampires where Jackie Chan was comic relief, because they managed to suspend my disbelief, but this just made me want to shake the director awake, and give the writer a good talking to.
## 2: I remember the original series vividly mostly due to it's unique blend of wry humor and macabre subject matter. Kolchak was hard-bitten newsman from the Ben Hecht school of big-city reporting, and his gritty determination and wise-ass demeanor made even the most mundane episode eminently watchable. My personal fave was \\"The Spanish Moss Murders\\" due to it's totally original storyline. A poor,troubled Cajun youth from Louisiana bayou country, takes part in a sleep research experiment, for the purpose of dream analysis. Something goes inexplicably wrong, and he literally dreams to life a swamp creature inhabiting the dark folk tales of his youth. This malevolent manifestation seeks out all persons who have wronged the dreamer in his conscious state, and brutally suffocates them to death. Kolchak investigates and uncovers this horrible truth, much to the chagrin of police captain Joe \\"Mad Dog\\" Siska(wonderfully essayed by a grumpy Keenan Wynn)and the head sleep researcher played by Second City improv founder, Severn Darden, to droll, understated perfection. The wickedly funny, harrowing finale takes place in the Chicago sewer system, and is a series highlight. Kolchak never got any better. Timeless.
## 3: Despite the other comments listed here, this is probably the best Dirty Harry movie made; a film that reflects -- for better or worse -- the country's socio-political feelings during the Reagan glory years of the early '80's. It's also a kickass action movie.<br /><br />Opening with a liberal, female judge overturning a murder case due to lack of tangible evidence and then going straight into the coffee shop encounter with several unfortunate hoodlums (the scene which prompts the famous, \\"Go ahead, make my day\\" line), \\"Sudden Impact\\" is one non-stop roller coaster of an action film. The first time you get to catch your breath is when the troublesome Inspector Callahan is sent away to a nearby city to investigate the background of a murdered hood. It gets only better from there with an over-the-top group of grotesque thugs for Callahan to deal with along with a sherriff with a mysterious past. Superb direction and photography and a at-times hilarious script help make this film one of the best of the '80's.
## 4: I think this movie would be more enjoyable if everyone thought of it as a picture of colonial Africa in the 50's and 60's rather than as a story. Because there is no real story here. Just one vignette on top of another like little points of light that don't mean much until you have enough to paint a picture. The first time I saw Chocolat I didn't really \\"get it\\" until having thought about it for a few days. Then I realized there were lots of things to \\"get\\", including the end of colonialism which was but around the corner, just no plot. Anyway, it's one of my all-time favorite movies. The scene at the airport with the brief shower and beautiful music was sheer poetry. If you like \\"exciting\\" movies, don't watch this--you'll be bored to tears. But, for some of you..., you can thank me later for recommending it to you.
## 5: The film begins with promise, but lingers too long in a sepia world of distance and alienation. We are left hanging, but with nothing much else save languid shots of grave and pensive male faces to savour. Certainly no rope up the wall to help us climb over. It's a shame, because the concept is not without merit.<br /><br />We are left wondering why a loving couple - a father and son no less - should be so estranged from the real world that their own world is preferable when claustrophobic beyond all imagining. This loss of presence in the real world is, rather too obviously and unnecessarily, contrasted with the son having enlisted in the armed forces. Why not the circus, so we can at least appreciate some colour? We are left with a gnawing sense of loss, but sadly no enlightenment, which is bewildering given the film is apparently about some form of attainment not available to us all.
## 6: This is a film that had a lot to live down to . on the year of its release legendary film critic Barry Norman considered it the worst film of the year and I'd heard nothing but bad things about it especially a plot that was criticised for being too complicated <br /><br />To be honest the plot is something of a red herring and the film suffers even more when the word \\" plot \\" is used because as far as I can see there is no plot as such . There's something involving Russian gangsters , a character called Pete Thompson who's trying to get his wife Sarah pregnant , and an Irish bloke called Sean . How they all fit into something called a \\" plot \\" I'm not sure . It's difficult to explain the plots of Guy Ritchie films but if you watch any of his films I'm sure we can all agree that they all posses one no matter how complicated they may seem on first viewing . Likewise a James Bond film though the plots are stretched out with action scenes . You will have a serious problem believing RANCID ALUMINIUM has any type of central plot that can be cogently explained <br /><br />Taking a look at the cast list will ring enough warning bells as to what sort of film you'll be watching . Sadie Frost has appeared in some of the worst British films made in the last 15 years and she's doing nothing to become inconsistent . Steven Berkoff gives acting a bad name ( and he plays a character called Kant which sums up the wit of this movie ) while one of the supporting characters is played by a TV presenter presumably because no serious actress would be seen dead in this <br /><br />The only good thing I can say about this movie is that it's utterly forgettable . I saw it a few days ago and immediately after watching I was going to write a very long a critical review warning people what they are letting themselves in for by watching , but by now I've mainly forgotten why . But this doesn't alter the fact that I remember disliking this piece of crap immensely
The processing steps are:
prep_fun = tolower
tok_fun = word_tokenizer
#Create an iterator to pass to the create_vocabulary function
it_train = itoken(train$review,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = train$id,
progressbar = FALSE)
#Now create a vocabulary
vocab = create_vocabulary(it_train)
print(vocab)## Number of docs: 4000
## 0 stopwords: ...
## ngram_min = 1; ngram_max = 1
## Vocabulary:
## terms terms_counts doc_counts
## 1: overturned 1 1
## 2: disintegration 1 1
## 3: vachon 1 1
## 4: interfered 1 1
## 5: michonoku 1 1
## ---
## 35592: penises 2 2
## 35593: arabian 1 1
## 35594: personal 102 94
## 35595: end 921 743
## 35596: address 10 10
An iterator is an object that traverses a container. A list is iterable. See: https://www.r-bloggers.com/iterators-in-r/
vectorizer = vocab_vectorizer(vocab)dtm_train = create_dtm(it_train, vectorizer)
print(dim(as.matrix(dtm_train)))## [1] 4000 35596
library(glmnet)## Warning: package 'glmnet' was built under R version 3.2.4
## Loading required package: foreach
## Loaded glmnet 2.0-5
NFOLDS = 4
res = cv.glmnet(x = dtm_train, y = train[['sentiment']],
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = NFOLDS,
thresh = 1e-3,
maxit = 1e3)
plot(res)it_test = test$review %>% prep_fun %>% tok_fun %>%
itoken(ids = test$id, progressbar = FALSE)
dtm_test = create_dtm(it_test, vectorizer)
preds = predict(res, dtm_test, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)## [1] 0.916697
n-grams are phrases made by coupling words that co-occur. For example, a bi-gram is a set of two consecutive words.
vocab = create_vocabulary(it_train, ngram = c(1, 2))
print(vocab)## Number of docs: 4000
## 0 stopwords: ...
## ngram_min = 1; ngram_max = 2
## Vocabulary:
## terms terms_counts doc_counts
## 1: bad_characterization 1 1
## 2: few_step 1 1
## 3: also_took 1 1
## 4: in_graphics 1 1
## 5: like_poke 1 1
## ---
## 397499: original_uncut 1 1
## 397500: settle_his 2 2
## 397501: first_blood 2 1
## 397502: occasional_at 1 1
## 397503: the_brothers 14 14
This creates a vocabulary of both single words and b-grams. Notice how large it is compared to the unigram vocabulary from earlier. Because of this we go ahead and prune the vocabulary first, as this will speed up computation.
vocab = vocab %>% prune_vocabulary(term_count_min = 10,
doc_proportion_max = 0.5)
print(vocab)## Number of docs: 4000
## 0 stopwords: ...
## ngram_min = 1; ngram_max = 2
## Vocabulary:
## terms terms_counts doc_counts
## 1: morvern 14 1
## 2: race_films 10 1
## 3: bazza 11 1
## 4: thunderbirds 10 1
## 5: mary_lou 21 1
## ---
## 17866: br_also 36 36
## 17867: a_better 96 89
## 17868: tourists 10 10
## 17869: in_each 14 14
## 17870: the_brothers 14 14
bigram_vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, bigram_vectorizer)
res = cv.glmnet(x = dtm_train, y = train[['sentiment']],
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = NFOLDS,
thresh = 1e-3,
maxit = 1e3)
plot(res)print(names(res))## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "name" "glmnet.fit" "lambda.min" "lambda.1se"
#AUC (area under curve)
print(max(res$cvm))## [1] 0.9217034
dtm_test = create_dtm(it_test, bigram_vectorizer)
preds = predict(res, dtm_test, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)## [1] 0.9268974
We have seen the TF-IDF discussion earlier, and here we see how to implement it using the text2vec package.
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)
tfidf = TfIdf$new()
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
dtm_test_tfidf = create_dtm(it_test, vectorizer) %>% transform(tfidf)Now we take the TF-IDF adjusted DTM and run the classifier.
res = cv.glmnet(x = dtm_train_tfidf, y = train[['sentiment']],
family = 'binomial',
alpha = 1,
type.measure = "auc",
nfolds = NFOLDS,
thresh = 1e-3,
maxit = 1e3)
print(paste("max AUC =", round(max(res$cvm), 4)))## [1] "max AUC = 0.913"
#Test on hold-out sample
preds = predict(res, dtm_test_tfidf, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)## [1] 0.8994684
From: http://stackoverflow.com/questions/39514941/preparing-word-embeddings-in-text2vec-r-package
Do the entire creation of the TCM (Term Co-occurrence Matrix)
library(magrittr)##
## Attaching package: 'magrittr'
## The following object is masked from 'package:qdap':
##
## %>%
library(text2vec)
data("movie_review")
tokens = movie_review$review %>% tolower %>% word_tokenizer()
it = itoken(tokens)
v = create_vocabulary(it) %>% prune_vocabulary(term_count_min=10)
vectorizer = vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 5)
tcm = create_tcm(it, vectorizer)
print(dim(tcm))## [1] 7797 7797
Now fit the word embeddings using GloVe See: http://nlp.stanford.edu/projects/glove/
model = GlobalVectors$new(word_vectors_size=50, vocabulary=v,
x_max=10, learning_rate=0.20)
model$fit(tcm,n_iter=25)## 2016-12-11 10:13:29 - epoch 1, expected cost 0.0822
## 2016-12-11 10:13:30 - epoch 2, expected cost 0.0504
## 2016-12-11 10:13:31 - epoch 3, expected cost 0.0431
## 2016-12-11 10:13:31 - epoch 4, expected cost 0.0388
## 2016-12-11 10:13:32 - epoch 5, expected cost 0.0359
## 2016-12-11 10:13:33 - epoch 6, expected cost 0.0336
## 2016-12-11 10:13:33 - epoch 7, expected cost 0.0320
## 2016-12-11 10:13:34 - epoch 8, expected cost 0.0306
## 2016-12-11 10:13:34 - epoch 9, expected cost 0.0297
## 2016-12-11 10:13:35 - epoch 10, expected cost 0.0287
## 2016-12-11 10:13:36 - epoch 11, expected cost 0.0280
## 2016-12-11 10:13:36 - epoch 12, expected cost 0.0274
## 2016-12-11 10:13:37 - epoch 13, expected cost 0.0269
## 2016-12-11 10:13:37 - epoch 14, expected cost 0.0264
## 2016-12-11 10:13:38 - epoch 15, expected cost 0.0260
## 2016-12-11 10:13:39 - epoch 16, expected cost 0.0256
## 2016-12-11 10:13:39 - epoch 17, expected cost 0.0253
## 2016-12-11 10:13:40 - epoch 18, expected cost 0.0250
## 2016-12-11 10:13:41 - epoch 19, expected cost 0.0248
## 2016-12-11 10:13:41 - epoch 20, expected cost 0.0246
## 2016-12-11 10:13:42 - epoch 21, expected cost 0.0244
## 2016-12-11 10:13:42 - epoch 22, expected cost 0.0241
## 2016-12-11 10:13:43 - epoch 23, expected cost 0.0240
## 2016-12-11 10:13:44 - epoch 24, expected cost 0.0238
## 2016-12-11 10:13:44 - epoch 25, expected cost 0.0237
wv = model$get_word_vectors() #Dimension words x wvec_size#Make distance matrix
d = dist2(wv, method="cosine") #Smaller values means closer
print(dim(d))## [1] 7797 7797
#Pass: w=word, d=dist matrix, n=nomber of close words
findCloseWords = function(w,d,n) {
words = rownames(d)
i = which(words==w)
if (length(i) > 0) {
res = sort(d[i,])
print(as.matrix(res[2:(n+1)]))
}
else {
print("Word not in corpus.")
}
}Example: Show the ten words close to the word “man” and “woman”.
print(findCloseWords("man",d,10))## [,1]
## woman 0.1307530
## young 0.2417085
## who 0.2716574
## girl 0.2761752
## guy 0.3217673
## person 0.3422519
## boy 0.3628652
## plays 0.3815644
## kid 0.4020192
## a 0.4031629
## [,1]
## woman 0.1307530
## young 0.2417085
## who 0.2716574
## girl 0.2761752
## guy 0.3217673
## person 0.3422519
## boy 0.3628652
## plays 0.3815644
## kid 0.4020192
## a 0.4031629
print(findCloseWords("woman",d,10))## [,1]
## man 0.1307530
## young 0.1868513
## girl 0.2402866
## guy 0.3020979
## who 0.3086067
## boy 0.3364845
## named 0.3558772
## plays 0.3849196
## old 0.3954155
## lady 0.3958985
## [,1]
## man 0.1307530
## young 0.1868513
## girl 0.2402866
## guy 0.3020979
## who 0.3086067
## boy 0.3364845
## named 0.3558772
## plays 0.3849196
## old 0.3954155
## lady 0.3958985
This is a very useful feature of word embeddings, as it is often argued that in the embedded space, words that are close to each other, also tend to have semantic similarities, even though the closeness is computed simply by using their co-occurence frequencies.
For more details, see: https://www.quora.com/How-does-word2vec-work
A geometrical interpretation: word2vec is a shallow word embedding model. This means that the model learns to map each discrete word id (0 through the number of words in the vocabulary) into a low-dimensional continuous vector-space from their distributional properties observed in some raw text corpus. Geometrically, one may interpret these vectors as tracing out points on the outside surface of a manifold in the “embedded space”. If we initialize these vectors from a spherical gaussian distribution, then you can imagine this manifold to look something like a hypersphere initially.
Let us focus on the CBOW for now. CBOW is trained to predict the target word t from the contextual words that surround it, c, i.e. the goal is to maximize P(t | c) over the training set. I am simplifying somewhat, but you can show that this probability is roughly inversely proportional to the distance between the current vectors assigned to t and to c. Since this model is trained in an online setting (one example at a time), at time T the goal is therefore to take a small step (mediated by the “learning rate”) in order to minimize the distance between the current vectors for t and c (and thereby increase the probability P(t |c)). By repeating this process over the entire training set, we have that vectors for words that habitually co-occur tend to be nudged closer together, and by gradually lowering the learning rate, this process converges towards some final state of the vectors.
By the Distributional Hypothesis (Firth, 1957; see also the Wikipedia page on Distributional semantics), words with similar distributional properties (i.e. that co-occur regularly) tend to share some aspect of semantic meaning. For example, we may find several sentences in the training set such as “citizens of X protested today” where X (the target word t) may be names of cities or countries that are semantically related.
You can therefore interpret each training step as deforming or morphing the initial manifold by nudging the vectors for some words somewhat closer together, and the result, after projecting down to two dimensions, is the familiar t-SNE visualizations where related words cluster together (e.g. Word representations for NLP).
For the skipgram, the direction of the prediction is simply inverted, i.e. now we try to predict P(citizens | X), P(of | X), etc. This turns out to learn finer-grained vectors when one trains over more data. The main reason is that the CBOW smooths over a lot of the distributional statistics by averaging over all context words while the skipgram does not. With little data, this “regularizing” effect of the CBOW turns out to be helpful, but since data is the ultimate regularizer the skipgram is able to extract more information when more data is available.
There’s a bit more going on behind the scenes, but hopefully this helps to give a useful geometrical intuition as to how these models work.
Uses Latent Dirichlet Allocation.
library(tm)
library(text2vec)
stopw = stopwords('en')
stopw = c(stopw,"br","t","s","m","ve","2","d","1")
#Make DTM
data("movie_review")
tokens = movie_review$review %>% tolower %>% word_tokenizer()
it = itoken(tokens)
v = create_vocabulary(it, stopwords = stopw) %>%
prune_vocabulary(term_count_min=5)
vectrzr = vocab_vectorizer(v, grow_dtm = TRUE, skip_grams_window = 5)
dtm = create_dtm(it, vectrzr)
print(dim(dtm))## [1] 5000 12733
#Do LDA
lda = LatentDirichletAllocation$new(n_topics=5, v)
lda$fit(dtm,n_iter = 25)
doc_topics = lda$fit_transform(dtm,n_iter = 25)
print(dim(doc_topics))## [1] 5000 5
#Get word vectors by topic
topic_wv = lda$get_word_vectors()
print(dim(topic_wv))## [1] 12733 5
#Plot LDA
library(LDAvis)
lda$plot()## Loading required namespace: servr
This produces a terrific interactive plot.
lsa = LatentSemanticAnalysis$new(n_topics = 5)
res = lsa$fit_transform(dtm)
print(dim(res))## [1] 5000 5
Biblio at: http://srdas.github.io/Das_TextAnalyticsInFinance.pdf